2. Mathematics of Deep Learning

“Every number is composed of units, and every number can be divided into units” - Al-Khwarizmi (780-850), Persian mathematician

This chapter explores the mathematical concepts that form the core of deep learning. Deep learning models are composed of complex mathematical functions. A deep understanding of linear algebra, calculus, probability, and statistics is essential to grasp the principles of model behavior, improve performance, and design new models. For example, understanding matrix operations is crucial for understanding the behavior of Convolutional Neural Networks (CNNs), and differentiation and optimization play a key role in understanding the learning process of models.

If this chapter feels difficult, you can move on to the next one and come back to it later. It’s best to revisit and get familiar with it from time to time.

2.1 Basics of Linear Algebra

Linear algebra is the foundation of deep learning. From matrix operations to advanced optimization techniques, linear algebra is an essential tool. This section will cover basic concepts such as vectors, matrices, and tensors, as well as advanced topics like singular value decomposition and principal component analysis.

2.1.1 Vectors

Vectors and matrices are the most basic operations for representing data and transforming it.

Vector Basics

A vector is a mathematical object that represents a quantity with both magnitude and direction. The mathematical definition is the same, but the perspective varies depending on the field of application.

Mathematical perspective: In mathematics, a vector is defined as an abstract object with magnitude and direction, which is closed under addition and scalar multiplication.
Physical perspective: In physics, vectors are mainly used to represent physical quantities such as force, velocity, and acceleration. In this case, the magnitude and direction of the vector have real physical meaning. Physics treats all changes as vectors, and the dimension of the vector is limited. For example, space is 3D, and spacetime is 4D.
Computer science perspective: In computer science, especially in machine learning and deep learning, vectors are mainly used to represent features of data. Here, each element of the vector represents a specific property of the data, and it may not necessarily have physical directionality. The dimension of the vector can range from tens to thousands.

Understanding these different perspectives is important when working with vectors in deep learning. In deep learning, vectors are mainly used from a computer science perspective, but mathematical operations and physical intuition are also utilized.

In deep learning, vectors are often used to represent multiple features of data simultaneously. For example, a 5-dimensional vector used in a housing price prediction model can be represented as:

\(\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ v_4 \\ v_5 \end{bmatrix}\)

Each element of this vector represents different characteristics of the house. \(v_1\): area of the house (square meters), \(v_2\): number of rooms, \(v_3\): age of the house (years), \(v_4\): distance to the nearest school (kilometers), \(v_5\): crime rate (percentage)

Deep learning models can use these multi-dimensional vectors as input to predict housing prices. Vectors are used to effectively represent and process complex real-world data with multiple features.

In NumPy, vectors can be easily created and used.

Code

!pip install dldna[colab] # in Colab

# !pip install dldna[all] # in your local

Code

import numpy as np

# Vector creation
v = np.array([1, 2, 3])

# Vector magnitude (L2 norm)
magnitude = np.linalg.norm(v)
print(f"Vector magnitude: {magnitude}")

# Vector normalization
normalized_v = v / magnitude
print(f"Normalized vector: {normalized_v}")

Vector magnitude: 3.7416573867739413
Normalized vector: [0.26726124 0.53452248 0.80178373]

A closer look at the concept of vectors reveals distinctions between row vectors and column vectors, as well as covectors and contravariant vectors used in physics and engineering.

Row Vectors and Column Vectors

Vectors are generally represented as column vectors. Row vectors can be considered as the transpose of column vectors. More mathematically accurate, row vectors can also be referred to as dual vectors or covectors.

Column vector: \(\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ v_3 \end{bmatrix}\), Row vector: \(\mathbf{v}^T = [v_1 \quad v_2 \quad v_3]\)

Row vectors and column vectors have different properties. Row vectors act on column vectors as linear functions to produce scalars, which is represented by the dot product operation.

\[\mathbf{u}^T\mathbf{v} = u_1v_1 + u_2v_2 + u_3v_3\]

Covectors and Contravariant Vectors

In physics and engineering, the concepts of covectors (covariant vectors) and contravariant vectors are importantly dealt with. This represents the transformation characteristics of vectors according to coordinate system changes.

Contravariant vector: A vector that transforms in the opposite direction of the basis when the coordinate system changes. It is generally denoted with a superscript (e.g., \(v^i\)).
Covector: A vector that transforms in the same direction as the basis when the coordinate system changes. It is generally denoted with a subscript (e.g., \(v_i\)).

This distinction is crucial in tensor notation. For example, \(T^i_j\) indicates that the upper index \(i\) represents contravariance and the lower index \(j\) represents covariance. Notably, in general relativity, these concepts of covariance and contravariance are treated as very important.

Application in Deep Learning

In deep learning, the distinction between covariance and contravariance is often not explicitly emphasized. The reasons for this include:

Standardized data representation: Deep learning typically handles data in a standardized form (e.g., column vectors), making the distinction between covariance and contravariance less important.
Euclidean space assumption: Many deep learning models assume that data exists in Euclidean space, where distinctions between covariance and contravariance are not clear.
Simplification of operations: Key deep learning operations (e.g., matrix multiplication, application of activation functions) can be effectively performed without these distinctions.
Automatic differentiation: Modern deep learning frameworks’ automatic differentiation capabilities can accurately compute gradients without considering these subtle distinctions.

However, in specific fields, particularly physics-based machine learning or geometric deep learning, these concepts may still be important. For instance, in deep learning models that utilize differential geometry, the distinction between covariance and contravariance can play a critical role in model design and interpretation.

In conclusion, while the basic concept of vectors in deep learning is simplified, more complex mathematical concepts still play significant roles in advanced model design and special applications.

Click to view contents (Deep Dive: Vector Space and Linear Combination)

Vector Space and Linear Combination

The vector space, a core concept in linear algebra, provides a fundamental framework for representing and transforming data in deep learning. In this deep dive, we’ll explore the rigorous definition of vector spaces and related concepts, along with examples of their applications in deep learning.

Vector Space

A vector space consists of a set \(V\) that satisfies eight axioms, along with addition and scalar multiplication operations. Here, elements of \(V\) are called vectors, and scalars are real numbers \(\mathbb{R}\) or complex numbers \(\mathbb{C}\). (In deep learning, we mainly use real numbers.)

Vector Addition: For any two elements \(\mathbf{u}, \mathbf{v}\) in \(V\), \(\mathbf{u} + \mathbf{v}\) is also an element of \(V\). (Closed under addition)

Scalar Multiplication: For any element \(\mathbf{u}\) in \(V\) and scalar \(c\), \(c\mathbf{u}\) is also an element of \(V\). (Closed under scalar multiplication)

Vector Addition and Scalar Multiplication must satisfy the following eight axioms. (\(\mathbf{u}, \mathbf{v}, \mathbf{w} \in V\), \(c, d\): scalars)

Commutativity of addition: \(\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}\)
Associativity of addition: \((\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})\)
Additive identity: For all \(\mathbf{u} \in V\), there exists a \(\mathbf{0} \in V\) (zero vector) such that \(\mathbf{u} + \mathbf{0} = \mathbf{u}\).
Additive inverse: For each \(\mathbf{u} \in V\), there exists a \(-\mathbf{u} \in V\) (additive inverse) such that \(\mathbf{u} + (-\mathbf{u}) = \mathbf{0}\).
Distributivity of scalar multiplication with respect to vector addition: \(c(\mathbf{u} + \mathbf{v}) = c\mathbf{u} + c\mathbf{v}\)
Distributivity of scalar multiplication with respect to scalar addition: \((c + d)\mathbf{u} = c\mathbf{u} + d\mathbf{u}\)
Compatibility of scalar multiplication with scalar multiplication: \(c(d\mathbf{u}) = (cd)\mathbf{u}\)
Identity element of scalar multiplication: \(1\mathbf{u} = \mathbf{u}\) (here, 1 is the identity element of scalar multiplication)

Example: * \(\mathbb{R}^n\): \(n\)-dimensional real vector space (n-tuples of real numbers) * \(\mathbb{C}^n\): \(n\)-dimensional complex vector space * \(M_{m \times n}(\mathbb{R})\): \(m \times n\) real matrix space * \(P_n\): space of polynomials with real coefficients of degree at most \(n\) * \(C[a, b]\): space of real-valued continuous functions on the interval \([a, b]\)

Subspace

A subset \(W\) of a vector space \(V\) is called a subspace if it satisfies the following conditions:

\(\mathbf{0} \in W\) (contains the zero vector)
If \(\mathbf{u}, \mathbf{v} \in W\), then \(\mathbf{u} + \mathbf{v} \in W\) (closed under addition)
If \(\mathbf{u} \in W\) and \(c\) is a scalar, then \(c\mathbf{u} \in W\) (closed under scalar multiplication)

In other words, a subspace is a subset of a vector space that is itself a vector space.

Linear Combination

Given vectors \(\mathbf{v}_1, \mathbf{v}_2, ..., \mathbf{v}_k\) in a vector space \(V\) and scalars \(c_1, c_2, ..., c_k\), an expression of the form \(c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + ... + c_k\mathbf{v}_k\) is called a linear combination.

Linear Independence and Linear Dependence

A set of vectors {\(\mathbf{v}_1, \mathbf{v}_2, ..., \mathbf{v}_k\)} is said to be linearly independent if the only way to express the zero vector as a linear combination of these vectors is with all coefficients equal to zero:

\(c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + ... + c_k\mathbf{v}_k = \mathbf{0}\) implies \(c_1 = c_2 = ... = c_k = 0\)

If this condition is not satisfied (i.e., there exist non-zero scalars \(c_1, ..., c_k\) such that the equation holds), the set of vectors is said to be linearly dependent.

Intuitive meaning:

Linear Independence: Vectors can be thought of as pointing in “different directions”. No vector can be expressed as a linear combination of the others.
Linear Dependence: Some vectors can be thought of as lying in the “same direction” (or “plane”, “hyperplane”). At least one vector can be expressed as a linear combination of the others.

Basis and Dimension

Basis: A basis for a vector space \(V\) is a set of vectors that satisfies two conditions:
1. Linear independence.
2. Spans \(V\) (see span below).
Dimension: The number of vectors in a basis for a vector space is called the dimension of the space. (dim \(V\))

Key point: While a basis for a given vector space is not unique, all bases have the same number of vectors.

Span

The span of a set of vectors {\(\mathbf{v}_1, \mathbf{v}_2, ..., \mathbf{v}_k\)} is the set of all possible linear combinations of these vectors:

span{\(\mathbf{v}_1, \mathbf{v}_2, ..., \mathbf{v}_k\)} = {\(c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + ... + c_k\mathbf{v}_k\) | \(c_1, c_2, ..., c_k\) are scalars}

In other words, it is the set of all vectors that can be formed using the given vectors. The span is always a subspace. #### Vector Space Examples in Deep Learning

Feature Vector: Input data for deep learning models, such as images, text, and audio, are often represented as high-dimensional vectors. For example, a 28x28 pixel grayscale image can be represented as a 784-dimensional vector. Each dimension represents the brightness value of a specific pixel in the image.
Weight Vector: Each layer of a neural network consists of a weight matrix and a bias vector. Each row (or column) of the weight matrix can be viewed as a vector representing the weights of a particular neuron.
Embedding Vector: Used to represent words, users, items, etc. in a low-dimensional vector space. Word2Vec, GloVe, and BERT are representative embedding techniques that represent words as vectors.
Latent Space: Autoencoders, Variational Autoencoders (VAE), and Generative Adversarial Networks (GAN) learn to map data to a low-dimensional latent space. This latent space can also be viewed as a vector space.

Click to view contents (Deep Dive: Norum and Street - Deep Learning Perspective)

Norms and Distances in Deep Learning

Measuring the magnitude of vectors or the distance between two vectors is crucial in deep learning. It is utilized in various areas, including loss functions, regularization, and similarity measurements.

Norms

The Lp-norm of a vector \(\mathbf{x} = [x_1, x_2, ..., x_n]\) is defined as follows (\(p \ge 1\)).

\(||\mathbf{x}||_p = \left( \sum_{i=1}^{n} |x_i|^p \right)^{1/p}\)

L1-norm (\(p=1\)): \(||\mathbf{x}||_1 = \sum_{i=1}^{n} |x_i|\) (Manhattan distance, Taxicab norm)
- Characteristics: The sum of the absolute values of each element. Useful when the magnitude of each feature in the vector is important.
- Deep Learning Application: L1 regularization (Lasso regression) is used to create sparse models (some weights are 0) by limiting the sum of the absolute values of the weights.
L2-norm (\(p=2\)): \(||\mathbf{x}||_2 = \sqrt{\sum_{i=1}^{n} x_i^2}\) (Euclidean norm)
- Characteristics: The straight-line distance from the origin to the vector coordinates (Pythagorean theorem). The most widely used norm.
- Deep Learning Application: L2 regularization (Ridge regression) is used to prevent overfitting by limiting the sum of the squares of the weights. Also known as weight decay.
L∞-norm (\(p \to \infty\)): \(||\mathbf{x}||_\infty = \max_i |x_i|\)
- Characteristics: The maximum absolute value among the vector elements.
- Deep Learning Application: (Less common) Used when you want to limit the value of a specific element from becoming too large.

Distances

The distance between two vectors \(\mathbf{x}\) and \(\mathbf{y}\) is generally defined as the norm of their difference.

\(d(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||\)

L1 Distance: \(d(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_1 = \sum_{i=1}^{n} |x_i - y_i|\)
L2 Distance: \(d(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_2 = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}\)

Examples of Applications in Deep Learning:

Loss Functions: MSE (L2 Loss), MAE (L1 Loss)
Regularization: L1 regularization, L2 regularization (weight decay)
Similarity/Distance-based Learning: k-NN, SVM, Siamese Network, Triplet Network, Contrastive Learning
Embeddings: Representing words, users, items, etc. in vector space and understanding similarity/relevance through vector distances.
Outlier Detection: Detecting outliers based on the distance between data points.

Note: In deep learning, it is essential to distinguish between “distance” and “similarity.” A smaller distance indicates higher similarity, while higher similarity means closer proximity. Cosine similarity is one of the commonly used methods for measuring similarity in deep learning.

Click to view contents (Deep Dive: Affine Space)

Affine Space

Affine space is a generalization of the vector space concept in linear algebra, and it is a useful tool for understanding deep learning models from a geometric perspective. In particular, affine transformations represent a form of linear transformation with bias added, which is frequently used in deep learning.

Definition of Affine Space

An affine space consists of three elements: a set of points, a vector space, and an operation that adds a point and a vector. More specifically:

Set of points (\(\mathcal{A}\)): A collection of geometric objects (points). Unlike vector spaces, there is no fixed origin.
Vector space (\(V\)): A set of vectors representing the displacement or difference between points. It satisfies all properties of a vector space (addition, scalar multiplication, and eight axioms).
Addition of point and vector (\(\mathcal{A} \times V \to \mathcal{A}\)): An operation that adds a point \(P \in \mathcal{A}\) and a vector \(\mathbf{v} \in V\) to obtain a new point \(Q \in \mathcal{A}\). This is denoted as \(Q = P + \mathbf{v}\).

This addition operation must satisfy two properties:

For any point \(P \in \mathcal{A}\), \(P + \mathbf{0} = P\) (where \(\mathbf{0}\) is the zero vector in the vector space \(V\))
For any points \(P, Q, R \in \mathcal{A}\), \((P + \mathbf{u}) + \mathbf{v} = P + (\mathbf{u} + \mathbf{v})\) (where \(\mathbf{u}, \mathbf{v} \in V\))

Important Characteristics

There is no special point called the “origin” in an affine space. All points are equal.
The difference between two points \(P, Q \in \mathcal{A}\) can be represented as a vector in the vector space \(V\): \(\overrightarrow{PQ} = Q - P \in V\). However, the sum of two points is not defined.
Using point and vector addition, it is possible to move from one point to another.

Affine Combination

Given points \(P_1, P_2, ..., P_k\) in an affine space \(\mathcal{A}\) and scalars \(c_1, c_2, ..., c_k\), the following expression is called an affine combination:

\(c_1P_1 + c_2P_2 + ... + c_kP_k\) (provided that \(c_1 + c_2 + ... + c_k = 1\))

Important: Unlike linear combinations, affine combinations require that the sum of the coefficients is 1. This condition reflects the property of affine spaces that there is no “origin”.

Affine Transformation

An affine transformation is a function from one affine space to another, expressed as a combination of a linear transformation and a translation. That is, an affine transformation includes both a linear transformation and a bias:

\(f(P) = T(P) + \mathbf{b}\)

\(T\): Linear transformation (from vector space \(V\) to \(V\))
\(\mathbf{b}\): Translation vector (an element of vector space \(V\))

Matrix Representation:

Affine transformations can be represented using augmented matrices. In an \(n\)-dimensional affine space, using \((n+1)\)-dimensional vectors allows the representation of affine transformations as \((n+1) \times (n+1)\) matrices. \(\begin{bmatrix} \mathbf{y} \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{A} & \mathbf{b} \\ \mathbf{0}^T & 1 \end{bmatrix} \begin{bmatrix} \mathbf{x} \\ 1 \end{bmatrix}\)

\(\mathbf{A}\): \(n \times n\) linear transformation matrix
\(\mathbf{b}\): \(n\)-dimensional translation vector
\(\mathbf{x}\): \(n\)-dimensional input vector (point in affine space)
\(\mathbf{y}\): \(n\)-dimensional output vector (point in affine space)

Affine Space and Affine Transformation in Deep Learning

Fully Connected Layer: The fully connected layer (dense layer) in deep learning performs an affine transformation. In \(\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}\), \(\mathbf{W}\mathbf{x}\) represents the linear transformation and \(\mathbf{b}\) represents the bias (translation).
Input Space: Input data in deep learning models is often represented as high-dimensional vectors, but can be considered as points in an affine space without an origin. For example, image data is represented as a vector with pixel brightness values as elements, but this vector space does not have a special origin.
Feature Space: Each layer of the neural network transforms the input into a new feature space. This transformation is often a combination of affine transformations (linear transformation + bias) and non-linear activation functions.
Data Augmentation: Operations such as rotation, translation, and scaling on image data can be represented as affine transformations.
Affine Layer: Unlike linear transformations, it considers bias.

Deep Learning Models Without Bias

Recently, some deep learning research has proposed models that remove the bias term for computational efficiency, model interpretability, or specific theoretical backgrounds.

DeepMind’s MuZero (2020): The reinforcement learning model MuZero does not use bias in its policy network and value network. The paper mentions that removing bias helps with representation learning.
OpenAI’s GPT (Generative Pre-trained Transformer) series: In some studies and implementations, the bias term is removed for computational efficiency. However, not all GPT series models do not use bias. Large models like GPT-3 often still use bias.
No-Bias Networks: Some research systematically analyzes the effect of removing bias on the model’s generalization performance.

Reasons for Removing Bias

Computational Efficiency: Removing the bias term reduces the number of model parameters, resulting in decreased computational and memory usage. This effect can be more significant in large models.
Representation Learning: In certain problems, the bias term may be unnecessary or even hinder representation learning. For example, MuZero believes that bias-free representations can learn more abstract and generalized representations.
Theoretical/Mathematical Grounds: Some models (e.g., certain types of generative models) may have a form without bias that is mathematically more natural or suitable for specific theoretical analyses.
Regularization Effect: There is also research suggesting that the absence of bias can have a regularization effect, allowing the weight matrix to carry more important information. Note: Removing bias does not always guarantee improved performance. The impact of bias on performance may vary depending on the characteristics of the problem, the structure of the model, and the amount of data.

The concept of affine space and affine transformation can be used for geometric interpretation of deep learning models, analysis of generalization performance, and design of new architectures.

2.1.2 Dimensions, Ranks

Terms related to tensors, vectors, and matrices are used slightly differently in mathematics, physics, and computer science, which can cause confusion. To avoid this confusion, let’s clarify the key concepts. First, we’ll look at the rank and dimensions of a tensor. The rank of a tensor refers to the number of indices it has. For example, a scalar is a rank 0 tensor, a vector is a rank 1 tensor, and a matrix is a rank 2 tensor. Tensors with three or more dimensions are generally just called tensors.

The term “dimension” can have two different meanings, so caution is needed. First, it can be used to mean the same as the rank of a tensor. In this case, a vector would be called a one-dimensional tensor and a matrix would be called a two-dimensional tensor. Second, it can be used to refer to the length or size of an array. For example, the dimension of a vector \(\mathbf{a} = [1, 2, 3, 4]\) would be said to be 4.

It’s also important to know the differences in terminology between fields. In physics, the number of elements has physical meaning, so the terms are used more strictly. On the other hand, in computer science, vectors, matrices, and tensors are often treated as arrays of numbers, and the term “dimension” is used interchangeably to refer to both the number of data elements and the number of indices.

To avoid confusion due to these differences in terminology, several things need to be kept in mind. The meaning of a term can vary depending on the context, so careful interpretation is necessary. It’s necessary to clearly distinguish what is meant by “dimension” in papers or books. In particular, in the field of deep learning, both the rank of a tensor and the size of an array are often referred to as “dimensions”, so consistent interpretation is important.

In deep learning frameworks, the term ‘dimension’ or ‘axis’ is used to represent the shape of a tensor. For example, in PyTorch, you can check the size of each dimension of a tensor using tensor.shape or tensor.size(). In this book, we will use ‘dimension’ to refer to the rank of a tensor and the length/size of an array as the value of each element in the shape or dimension.

2.1.3 Basics of Linear Transformations

Let’s take a look at the math needed for deep learning training. The linear transformation, which is the core operation of neural networks, is expressed very simply in forward calculations. In this section, we will focus on the basic linear operations before passing through the activation function.

The basic form of forward operations is as follows.

\[\boldsymbol y = \boldsymbol x \boldsymbol W + \boldsymbol b\]

Here, \(\boldsymbol x\) represents the input, \(\boldsymbol W\) represents the weight, \(\boldsymbol b\) represents the bias, and \(\boldsymbol y\) represents the output. In neural network mathematics, inputs and outputs are often represented as vectors, and weights are represented as matrices. Bias (\(\boldsymbol b\)) is sometimes expressed as a scalar value, but it should be accurately represented as a vector of the same form as the output.

Matrices and Linear Transformations

Matrices are powerful tools for expressing linear transformations. Linear transformations are processes that map a point in vector space to another point, which can be seen as a transformation of the entire space. For materials that help visualize these concepts, I recommend 3Blue1Brown’s video “Linear transformations and matrices”[1]. This video intuitively explains basic concepts of linear algebra and clearly shows how matrices transform spaces.

When input data \(\boldsymbol x\) is represented as a vector, it means a single data point, and the length of the vector becomes the number of features. However, in actual training processes, multiple data are usually processed at once. In this case, the input becomes a matrix \(\boldsymbol X\) of the form (n, m), where n represents the number of data and m represents the number of features.

In actual deep learning models, input data can take the form of tensors with dimensions higher than 2D matrices.

Image data: 4D tensor of the form (batch size, height, width, channel)
Video data: 5D tensor of the form (batch size, frame number, height, width, channel)

To handle such high-dimensional data, neural networks use various forms of linear and nonlinear transformations. In the reverse propagation process of linear transformations, gradients are calculated and transmitted in reverse order to each layer to update parameters. This process can be complex, but it is efficiently performed using automatic differentiation tools. Linear transformation is a basic component of deep learning models, but the actual performance of models is achieved through combinations with nonlinear activation functions. In the next section, we will look at how this nonlinearity increases the expressive power of models.

Code

# if in Colab, plase don't run this and below code. just see the result video bleow the following cell.
#from manim import *

Code

%%manim -qh -v WARNING LinearTransformations  
from manim import *
from manim import config

class LinearTransformations(ThreeDScene):
    def construct(self):
        self.set_camera_orientation(phi=75 * DEGREES, theta=-45 * DEGREES)
        axes = ThreeDAxes(x_range=[-6, 6, 1], y_range=[-6, 6, 1], z_range=[-6, 6, 1], x_length=10, y_length=10, z_length=10).set_color(GRAY)
        self.add(axes)

        # --- 3D Linear Transformation (Rotation and Shear) ---
        title = Text("3D Linear Transformations", color=BLACK).to_edge(UP)
        self.play(Write(title))
        self.wait(1)

        # 1. Rotation around Z-axis
        text_rotation = Text("Rotation around Z-axis", color=BLUE).scale(0.7).next_to(title, DOWN, buff=0.5)
        self.play(Write(text_rotation))
        cube = Cube(side_length=2, fill_color=BLUE, fill_opacity=0.5, stroke_color=WHITE, stroke_width=1)
        self.play(Create(cube))
        self.play(Rotate(cube, angle=PI/2, axis=OUT, about_point=ORIGIN), run_time=2)
        self.wait(1)
        self.play(FadeOut(text_rotation))

        # 2. Shear
        text_shear = Text("Shear Transformation", color=GREEN).scale(0.7).next_to(title, DOWN, buff=0.5)
        self.play(Write(text_shear))

        # Define the shear transformation matrix.  This shears in x relative to y, and in y relative to x.
        shear_matrix = np.array([
            [1, 0.5, 0],
            [0.5, 1, 0],
            [0, 0, 1]
        ])

        self.play(
            cube.animate.apply_matrix(shear_matrix),
            run_time=2,
        )
        self.wait(1)

        # Add transformed axes to visualize the shear
        transformed_axes = axes.copy().apply_matrix(shear_matrix)
        self.play(Create(transformed_axes), run_time=1)
        self.wait(1)

        self.play(FadeOut(cube), FadeOut(transformed_axes), FadeOut(text_shear))

        # --- 2D to 3D Transformation (Paraboloid) ---
        text_2d_to_3d = Text("2D to 3D: Paraboloid", color=MAROON).scale(0.7).next_to(title, DOWN, buff=0.5)
        self.play(Write(text_2d_to_3d))

        square = Square(side_length=4, fill_color=MAROON, fill_opacity=0.5, stroke_color=WHITE, stroke_width=1)
        self.play(Create(square))

        def paraboloid(point):  # Function for the transformation
            x, y, _ = point
            return [x, y, 0.2 * (x**2 + y**2)]  # Adjust scaling factor (0.2) as needed

        paraboloid_surface = always_redraw(lambda: Surface(
            lambda u, v: axes.c2p(*paraboloid(axes.p2c(np.array([u,v,0])))),
            u_range=[-2, 2],
            v_range=[-2, 2],
            resolution=(15, 15), # Added for smoothness
            fill_color=MAROON,
            fill_opacity=0.7,
            stroke_color=WHITE,
            stroke_width=0.5

        ).set_shade_in_3d(True))


        self.play(
            Transform(square, paraboloid_surface),
            run_time=3,
        )

        self.wait(2)
        self.play(FadeOut(square), FadeOut(text_2d_to_3d))

        # --- 3D to 2D Transformation (Projection) ---
        text_3d_to_2d = Text("3D to 2D: Projection", color=PURPLE).scale(0.7).next_to(title, DOWN, buff=0.5)
        self.play(Write(text_3d_to_2d))

        sphere = Sphere(radius=1.5, fill_color=PURPLE, fill_opacity=0.7, stroke_color=WHITE, stroke_width=1, resolution=(20,20)).set_shade_in_3d(True)
        self.play(Create(sphere))


        def project_to_2d(mob, alpha):
            for p in mob.points:
                p[2] *= (1-alpha)

        self.play(
            UpdateFromAlphaFunc(sphere, project_to_2d),
            run_time=2
        )

        self.wait(1)

        # Show a circle representing the final projection 
        circle = Circle(radius=1.5, color=PURPLE, fill_opacity=0.7, stroke_color = WHITE, stroke_width=1)
        self.add(circle)
        self.wait(1)

        self.play(FadeOut(sphere), FadeOut(text_3d_to_2d), FadeOut(circle), FadeOut(title))

        self.wait(1)

import logging
logging.getLogger("manim").setLevel(logging.WARNING)

if __name__ == "__main__":
    config.video_dir = "./"
    scene = LinearTransformations()
    scene.render()

2.1.4 Tensor Operations

Challenge: How can we efficiently represent and operate on multidimensional data?

Researcher’s Dilemma: In the early days of deep learning, researchers had to deal with various forms of data such as images, text, and audio. This data was difficult to express using simple vectors or matrices, and a method to effectively process complex data structures was needed. Additionally, efficient operation methods for processing large amounts of data quickly were also important tasks.

Tensors are the fundamental mathematical objects used to represent data and model parameters in deep learning. They can be thought of as multidimensional arrays, generalizing scalars, vectors, and matrices. Tensors are classified according to their dimension (dimension, rank) as follows:

0-dimensional tensor: scalar (e.g., 3.14)
1-dimensional tensor: vector (e.g., [1, 2, 3])
2-dimensional tensor: matrix (e.g., [[1, 2], [3, 4]])
3-dimensional or higher: high-dimensional tensor

In deep learning, we mainly deal with the following types of tensors:

Input Data:
- General: (batch size, number of features)
- Time series/Text: (batch size, sequence length, number of features/embedding dimension)
- Image: (batch size, height, width, channels)
Weights:
- Fully-connected: (number of input features, number of output features)
- Convolutional: (number of output channels, number of input channels, kernel height, kernel width)
Output Data (Output/Prediction):
- Classification: (batch size, number of classes)
- Regression: (batch size, output dimension)
Bias:
- Fully connected: (number of output features,)
- Convolutional: (number of output channels,)
Feature maps (Outputs of Convolutional layers): (batch size, number of output channels, height, width)

The basic linear transformation of a neural network is as follows:

\(y_j = \sum\limits_{i} x_i w_{ij} + b_j\)

where \(i\) is the index of the input and \(j\) is the index of the output. This can be expressed in vector and matrix form as follows:

\(\boldsymbol x = \begin{bmatrix}x_{1} & x_{2} & \cdots & x_{i} \end{bmatrix}\)

\(\boldsymbol W = \begin{bmatrix} w_{11} & \cdots & w_{1j} \ \vdots & \ddots & \vdots \ w_{i1} & \cdots & w_{ij} \end{bmatrix}\)

\(\boldsymbol b = \begin{bmatrix}b_{1} & b_{2} & \cdots & b_{j} \end{bmatrix}\)

\(\boldsymbol y = \boldsymbol x \boldsymbol W + \boldsymbol b\)

The main features of tensor operations are as follows:

Broadcasting: enables operations between tensors of different sizes.
Dimension reduction: reduces specific dimensions of a tensor using operations such as sum() and mean().
Reshaping: changes the shape of a tensor to transform it into a tensor with different dimensions.

One of the most important operations in neural network learning is gradient calculation. The main gradient calculations are as follows:

Gradient with respect to input: \(\frac{\partial \boldsymbol y}{\partial \boldsymbol{x}}\)
Gradient with respect to weights: \(\frac{\partial \boldsymbol y}{\partial \boldsymbol W}\)

These gradients represent the change in output with respect to changes in input and weights, and are the core of backpropagation algorithms. Tensor operations form the basis of modern deep learning, enabling efficient learning and inference of large models through highly parallel processing using GPUs. Additionally, automatic differentiation of tensor operations enables efficient gradient computation, which has become a major breakthrough in modern deep learning research. This goes beyond simple numerical computations, making the structure and learning process of models themselves programmable targets. We will look at practical examples of tensor operations in more detail in Chapter 3 on PyTorch.

2.1.5 Singular Value Decomposition and Principal Component Analysis

Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) are powerful mathematical tools used to reduce the dimensionality of high-dimensional data and extract the main features inherent in the data.

Singular Value Decomposition (SVD)

SVD is a method of decomposing any \(m \times n\) matrix \(\mathbf{A}\) into the product of three matrices as follows:

\(\mathbf{A} = \mathbf{U\Sigma V^T}\)

where,

\(\mathbf{U}\): an \(m \times m\) orthogonal matrix (left singular vectors)
\(\mathbf{\Sigma}\): an \(m \times n\) diagonal matrix (singular values)
\(\mathbf{V}\): an \(n \times n\) orthogonal matrix (right singular vectors)

Key Idea:

Singular Values: The diagonal elements of \(\mathbf{\Sigma}\) (\(\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_r > 0\)) are the singular values of matrix \(\mathbf{A}\), representing the degree of data variation in the corresponding direction. Large singular values represent important features of the data, while small singular values represent noise or less important information.
Dimensionality Reduction: By using only the \(k\) largest singular values and their corresponding singular vectors to approximate matrix \(\mathbf{A}\) (\(\mathbf{A} \approx \mathbf{U}_k \mathbf{\Sigma}_k \mathbf{V}_k^T\)), we can reduce the dimensionality from \(m \times n\) to \(k\) while preserving the main features of the original data.

Application in Deep Learning:

Model Compression: Applying SVD to the weight matrix of a neural network and approximating it with a low-dimensional matrix can reduce the model size and improve inference speed. This is particularly effective for reducing the embedding matrix size in Transformer-based language models (e.g., BERT).
Recommendation System: SVD can be used to extract latent factors between users and items.
- Latent Factors: By decomposing user and item matrices using SVD, we can represent users and items in a low-dimensional space.
  - User’s latent factor: represents the user’s hidden preferences (e.g., movie enthusiast, likes action movies, likes romantic comedies, etc.).
  - Item’s latent factor: represents the item’s hidden features (e.g., blockbuster movie, starring actor A, happy ending, etc.).
- Low-Dimensional Representation: SVD allows us to approximate the originally large user-item matrix with a low-dimensional matrix product.
- Recommendation: We can calculate the similarity between users and items in the low-dimensional space or predict the probability that a user will prefer a particular item by computing the inner product.

Principal Component Analysis (PCA)

PCA is a method of finding the direction (principal component) that maximizes the variance of the data and projecting the data onto a lower-dimensional space. It is closely related to SVD and finds the principal components through eigenvalue decomposition of the data’s covariance matrix.

PCA Steps: 1. Data Centering: Makes the average of each feature 0. 2. Covariance Matrix Calculation: Calculates the covariance matrix that represents the correlation between features. 3. Eigenvalue Decomposition: Calculates the eigenvalues and eigenvectors of the covariance matrix. * Eigenvector: Direction of the principal component * Eigenvalue: Size of the variance in the direction of the principal component 4. Principal Component Selection: Selects \(k\) eigenvectors corresponding to the largest eigenvalues. (Reduces data to \(k\)-dimension) 5. Data Projection: Projects data onto the selected \(k\) principal components to reduce dimensions.

Application in Deep Learning:

Data Preprocessing: By projecting high-dimensional data such as images and text into a low-dimensional space, it can be used as input for deep learning models, reducing computational cost and preventing overfitting. Especially in image classification problems, representing high-resolution images in low dimensions through PCA can speed up model training.
Feature Extraction: The principal components extracted by PCA can be interpreted as new features that are uncorrelated with each other and preserve the maximum variance of the data.

SVD vs. PCA

SVD is a matrix decomposition technique, while PCA is a data dimensionality reduction technique.
PCA can be implemented using SVD. (The SVD of the data matrix is related to the eigendecomposition of the covariance matrix)
PCA requires a preprocessing step to center the data mean at 0, but SVD can be applied directly without this process.

SVD and PCA are mathematical tools that play an important role in efficiently representing data and improving model performance in deep learning.

Code

from dldna.chapter_02.pca import visualize_pca
visualize_pca()

Explained variance ratio: 0.5705

This example demonstrates the ability of PCA to project complex 2D structures into 1D. For spiral data, a single principal component cannot capture all variability, but it can capture the major trend of the data. The explained variance ratio can be used to evaluate how well this 1D representation preserves the structure of the original data.

These techniques are powerful tools for extracting important patterns from complex data.

Data preprocessing: dimensionality reduction of input data
Model compression: efficient approximation of weight matrices
Feature extraction: identification and selection of important features

SVD and PCA are powerful tools for extracting important patterns from high-dimensional data and simplifying complex data structures.

2.2 Calculus and Optimization

2.2.1 Chain Rule

Challenge: How can we efficiently compute the derivative of a complex nested function?

Researcher’s Concern: Early deep learning researchers had to use backpropagation algorithms to update neural network weights. However, since neural networks are structures with multiple layers of functions connected in a complicated way, calculating the derivative of the loss function for each weight was a very difficult problem. In particular, as the layers deepened, the amount of computation increased exponentially, making learning inefficient.

The most important calculus rule used in deep learning is the chain rule. The chain rule is a powerful and elegant rule that allows us to express the derivative of a composite function as the product of the derivatives of the constituent functions. Visualizing the chain rule can make it easier to understand. For example, let’s assume that \(z\) is a function of \(x\) and \(y\), and \(x\) and \(y\) are functions of \(s\) and \(t\), respectively. This relationship can be represented as a tree diagram.

In this diagram, the partial derivative of \(z\) with respect to \(s\), \(\frac{\partial z}{\partial s}\), is the sum of the products of the partial derivatives along all paths from \(z\) to \(s\).

\(\frac{\partial z}{\partial s} = \frac{\partial z}{\partial x} \frac{\partial x}{\partial s} + \frac{\partial z}{\partial y} \frac{\partial y}{\partial s}\)

In this formula,

\(\frac{\partial z}{\partial x}\) and \(\frac{\partial z}{\partial y}\) represent how \(z\) changes with respect to \(x\) and \(y\).
\(\frac{\partial x}{\partial s}\) and \(\frac{\partial y}{\partial s}\) represent how \(x\) and \(y\) change with respect to \(s\).

Let’s consider another case where the chain rule is used to express a total derivative. Suppose \(z\) is a function of mutually independent variables. In this case, the chain rule simplifies to the form of a total derivative. For example, if \(z = f(x, y)\) and \(x = g(s)\) and \(y = h(t)\), and \(s\) and \(t\) are independent, then the total derivative of \(z\) can be expressed as follows.

\(dz = \frac{\partial z}{\partial x}dx + \frac{\partial z}{\partial y}dy\)

Here, \(dx = \frac{\partial x}{\partial s}ds\) and \(dy = \frac{\partial y}{\partial t}dt\), so we finally get the following form.

\(dz = \frac{\partial z}{\partial x}\frac{\partial x}{\partial s}ds + \frac{\partial z}{\partial y}\frac{\partial y}{\partial t}dt\)

This equation looks similar to the chain rule, but it actually represents a total derivative. The important point here is that since \(s\) and \(t\) are independent, \(\frac{\partial x}{\partial t}\) and \(\frac{\partial y}{\partial s}\) are 0. This form is a total derivative. A total derivative represents the total effect of changes in all independent variables on the function value and can be expressed as the sum of partial derivatives for each variable. The chain rule’s structure allows the derivative of a complex function to be broken down into simpler components. This is especially important in deep learning, where neural networks are composed of multiple layers of functions. Using tree diagrams, the chain rule can be applied even in more complicated situations by finding all paths from the dependent variable to the independent variables, multiplying the partial derivatives along each path, and then summing these products.

The chain rule is the mathematical foundation for backpropagation algorithms in deep learning, enabling efficient updates of weights in complex neural network models.

2.2.2 Gradient and Jacobian

Challenge: How can we generalize the derivative for functions with various forms of input and output?

Researcher’s Concerns: Early deep learning primarily dealt with scalar functions, but it gradually had to handle functions with vectors, matrices, and other forms of input and output. Expressing and calculating the derivatives of these functions in a unified manner was an essential task for developing deep learning frameworks.

In deep learning, we deal with functions that have various forms of input (scalars, vectors, matrices, tensors) and output (scalars, vectors, matrices, tensors). Accordingly, the expression of the function’s derivative also changes. The key is to consistently express these various cases of derivatives and apply the chain rule to calculate them efficiently.

Key Concepts

Gradient: An expression used when differentiating a scalar function with respect to a vector. It is a column vector containing the partial derivatives of the function with respect to each element of the input vector, representing the direction of the steepest ascent.
Jacobian Matrix: An expression used when differentiating a vector function with respect to a vector. It is a matrix whose elements are the partial derivatives of each element of the output vector with respect to each element of the input vector.

Derivative Expressions for Various Input and Output Forms

Input Form	Output Form	Derivative Expression	Dimension
Vector (\(\mathbf{x}\))	Vector (\(\mathbf{f}\))	Jacobian Matrix (\(\mathbf{J} = \frac{\partial \mathbf{f}}{\partial \mathbf{x}}\))	\(n \times m\)
Matrix (\(\mathbf{X}\))	Vector (\(\mathbf{f}\))	3D Tensor (generally not well-handled)	-
Vector (\(\mathbf{x}\))	Matrix (\(\mathbf{F}\))	3D Tensor (generally not well-handled)	-
Scalar (\(x\))	Vector (\(\mathbf{f}\))	Column Vector (\(\frac{\partial \mathbf{f}}{\partial x}\))	\(n \times 1\)
Vector (\(\mathbf{x}\))	Scalar (\(f\))	Gradient (\(\nabla f = \frac{\partial f}{\partial \mathbf{x}}\))	\(m \times 1\) (column vector)
Matrix (\(\mathbf{X}\))	Scalar (\(f\))	Matrix (\(\frac{\partial f}{\partial \mathbf{X}}\))	\(m \times n\)

Note:

\(m\): Dimension of the input vector/matrix, \(n\): Dimension of the output vector/matrix, \(p, q\): Number of rows/columns in a matrix
In the case of matrix input and vector/matrix output, the derivative becomes a 3D tensor. Deep learning frameworks efficiently handle such high-dimensional tensor operations internally, but generally, Jacobian/gradient calculations for vector/matrix inputs and outputs are predominant.

Application in Deep Learning

Backpropagation Algorithm: The Jacobian matrix and gradient play a crucial role when implementing the backpropagation algorithm in deep learning. As the neural network passes through each layer, it applies the chain rule to calculate the gradient of the loss function with respect to the weights, and updates the weights accordingly.
Automatic Differentiation: Modern deep learning frameworks (TensorFlow, PyTorch, etc.) provide automatic differentiation features that automatically handle these complex differentiation calculations. Users do not need to implement complex differentiation formulas directly; they only need to define the model structure and loss function.

Thus, the concepts of gradients and Jacobian matrices are essential tools for generalizing derivatives of various forms of functions in deep learning and efficiently training models through backpropagation.

Click to view contents (Deep Dive: Hessian Matrix)

Hessian Matrix

1. Definition and Meaning of the Hessian Matrix

Definition: The Hessian matrix is a square matrix of second partial derivatives of a scalar-valued function. Given a function \(f(x_1, x_2, ..., x_n)\), the Hessian matrix \(H\) is defined as follows:

\[ H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} \]
- Each element represents the second partial derivative of the function with respect to each variable.
- Symmetric Matrix: If the function has continuous second partial derivatives, the Hessian matrix is symmetric because the order of partial differentiation can be swapped (Schwarz’s theorem). (\(\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}\))
Meaning:
- Curvature: The Hessian matrix contains local curvature information about the function. It describes how much the graph of the function is curved at a specific point.
- Rate of Change of the Rate of Change: While the first derivative (gradient) represents the rate of change of the function, the Hessian matrix represents how quickly this rate of change changes.

2. Hessian Matrix and Critical Point Determination

Critical Point: A point where the gradient of the function is zero. In other words, it’s a point where all first partial derivatives with respect to each variable are zero. (\(\nabla f = 0\))
Determination of Extrema:
- The Hessian matrix is used at critical points to determine whether the function has a local maximum, minimum, or saddle point.
- Local Minimum: If the Hessian matrix is positive definite (all eigenvalues are positive), then the critical point is a local minimum.
- Local Maximum: If the Hessian matrix is negative definite (all eigenvalues are negative), then the critical point is a local maximum.
- Saddle Point: If the Hessian matrix is indefinite (has both positive and negative eigenvalues), then the critical point is a saddle point.
- Semi-definite: If the Hessian is positive or negative semi-definite, additional information is needed to determine the type of extremum, as an eigenvalue is zero.

3. Application of the Hessian Matrix in Deep Learning

Newton’s Method:
- It is one of the optimization algorithms for finding the extreme value of a function.
- While the gradient descent method uses the first derivative (gradient), Newton’s method uses the second derivative (Hessian) to converge faster.
- Update rule: \(x_{k+1} = x_k - H^{-1}(x_k) \nabla f(x_k)\) (H is the Hessian matrix)
Curvature Matrix:
- The Hessian matrix can be used as a curvature matrix representing the curvature of the loss function.
- It utilizes curvature information to adjust the learning rate or improve the performance of optimization algorithms. (e.g., Natural Gradient Descent)

2.2.3 Chain Rule and Backpropagation in Neural Networks

The core of neural network learning is the backpropagation algorithm. Backpropagation is an efficient method that updates the weights and biases of each layer by propagating errors from the output layer to the input layer. In this process, the chain rule allows for the calculation of complex composite functions by expressing their derivatives as products of simpler derivatives.

Applying the Chain Rule in Neural Networks

A neural network can be seen as a composition of multiple layers of functions. For example, a two-layer neural network can be expressed as follows:

\(\mathbf{z} = f_1(\mathbf{x}; \mathbf{W_1}, \mathbf{b_1})\) \(\mathbf{y} = f_2(\mathbf{z}; \mathbf{W_2}, \mathbf{b_2})\)

Here, \(\mathbf{x}\) is the input, \(\mathbf{z}\) is the output of the first layer (input to the second layer), \(\mathbf{y}\) is the final output, and \(\mathbf{W_1}\), \(\mathbf{b_1}\) are the weights and biases of the first layer, while \(\mathbf{W_2}\), \(\mathbf{b_2}\) are those of the second layer.

During backpropagation, we need to calculate the gradients of the loss function \(E\) with respect to each parameter (\(\frac{\partial E}{\partial \mathbf{W_1}}\), \(\frac{\partial E}{\partial \mathbf{b_1}}\), \(\frac{\partial E}{\partial \mathbf{W_2}}\), \(\frac{\partial E}{\partial \mathbf{b_2}}\)). Applying the chain rule, we can compute these as follows:

\(\frac{\partial E}{\partial \mathbf{W_2}} = \frac{\partial E}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{W_2}}\) \(\frac{\partial E}{\partial \mathbf{b_2}} = \frac{\partial E}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{b_2}}\) \(\frac{\partial E}{\partial \mathbf{W_1}} = \frac{\partial E}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{z}} \frac{\partial \mathbf{z}}{\partial \mathbf{W_1}}\) \(\frac{\partial E}{\partial \mathbf{b_1}} = \frac{\partial E}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{z}} \frac{\partial \mathbf{z}}{\partial \mathbf{b_1}}\)

By using the chain rule, we can efficiently calculate the gradients of each parameter in a complex neural network as products of sequential derivatives. Section 2.2.4 provides a detailed theoretical dive into this process.

Gradients and Directional Derivatives

Gradient: A vector composed of the partial derivatives of a multivariable function with respect to each variable. It represents the direction of the steepest ascent.
Directional Derivative: Represents the rate of change of a function in a specific direction. It can be calculated as the dot product of the gradient and the direction vector.

Notes on Gradient Representation

Column Vector vs. Row Vector: Typically, vectors are expressed as column vectors by convention, but in deep learning, they can be expressed as row vectors depending on the context. Consistency is important. (This book uses the numerator notation.)
Jacobian Matrix: For a function (vector function) with multiple input variables and multiple output variables, it is a matrix containing all partial derivative values. It is used for backpropagation calculations in deep learning.

Based on these concepts, the next section will take a closer look at the method of calculating gradients in the backpropagation process with specific examples.

2.2.4 Gradient Calculation for Backpropagation

The core of backpropagation is to calculate the gradient of the loss function and update the weights. Let’s take a simple linear transformation (\(\mathbf{y} = \mathbf{xW} + \mathbf{b}\)) as an example to illustrate the backpropagation process.

1. Core Idea of Backpropagation

Backpropagation is an algorithm that updates the weights by propagating the error calculated in the output layer towards the input layer, adjusting each weight according to its contribution to the error. The key step in this process is calculating the gradient of the loss function with respect to each weight.

2. Gradient of the Loss Function

If we use the mean squared error (MSE) as the loss function, the gradient of the loss function \(E\) with respect to the output \(\mathbf{y}\) is as follows:

\(E = \frac{1}{M} \sum_{i=1}^{M} (y_i - \hat{y}_i)^2\)

\(\frac{\partial E}{\partial \mathbf{y}} = \frac{2}{M}(\mathbf{y} - \hat{\mathbf{y}})\)

Here, \(y_i\) is the actual value, \(\hat{y}_i\) is the predicted value of the model, and \(M\) is the number of data points.

3. Gradient with Respect to Weights

We can calculate the gradient of the loss function \(E\) with respect to the weights \(\mathbf{W}\) by applying the chain rule.

\(\frac{\partial E}{\partial \mathbf{W}} = \frac{\partial E}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{W}}\)

Since \(\mathbf{y} = \mathbf{xW} + \mathbf{b}\), we have \(\frac{\partial \mathbf{y}}{\partial \mathbf{W}} = \mathbf{x}^T\).

Ultimately, the gradient with respect to the weights is expressed as:

\(\frac{\partial E}{\partial \mathbf{W}} = \mathbf{x}^T \frac{\partial E}{\partial \mathbf{y}}\)

4. Gradient with Respect to Input

The gradient of the loss function \(E\) with respect to the input \(\mathbf{x}\) is used to propagate the error to the previous layer.

\(\frac{\partial E}{\partial \mathbf{x}} = \frac{\partial E}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}\)

Since \(\mathbf{y} = \mathbf{xW} + \mathbf{b}\), we have \(\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \mathbf{W}^T\).

Therefore, the gradient with respect to the input is:

\(\frac{\partial E}{\partial \mathbf{x}} = \frac{\partial E}{\partial \mathbf{y}} \mathbf{W}^T\)

5. Summary

Backpropagation is carried out through the following key steps. 1. Forward Propagation: Input data \(\mathbf{x}\) is passed through the neural network to calculate the predicted value \(\hat{\mathbf{y}}\). 2. Loss Function Calculation: The predicted value \(\hat{\mathbf{y}}\) and actual value \(\mathbf{y}\) are compared to calculate the loss \(E\). 3. Backward Propagation: * The gradient of the loss function with respect to the output \(\frac{\partial E}{\partial \mathbf{y}}\) is calculated in the output layer. * Using the chain rule, the gradient of the loss with respect to the weights \(\frac{\partial E}{\partial \mathbf{W}} = \mathbf{x}^T \frac{\partial E}{\partial \mathbf{y}}\) is calculated. * The gradient with respect to the input \(\frac{\partial E}{\partial \mathbf{x}} = \frac{\partial E}{\partial \mathbf{y}} \mathbf{W}^T\) is calculated, propagating the error to the previous layer. 4. Weight Update: The computed gradients are used to update the weights using optimization algorithms such as gradient descent.

The backpropagation algorithm is the core of deep learning model training, allowing for effective approximation of complex nonlinear functions.

Click to view the content (Deep Dive: Gradient Calculation for Backpropagation)

Gradient of the Loss Function

The goal of neural network learning is to minimize the loss function \(E\). If we use the mean squared error (MSE) as the loss function, it can be expressed as follows:

\(E = f(\mathbf{y}) = \frac{1}{M} \sum_{i=1}^{M} (y_i - \hat{y}_i)^2\)

where \(y_i\) is the actual value, \(\hat{y}_i\) is the predicted value, and \(M\) is the number of data (or the dimension of the output vector).

The derivative of \(E\) with respect to \(\mathbf{y}\) is as follows:

\(\frac{\partial E}{\partial \mathbf{y}} = \frac{2}{M} (\mathbf{y} - \hat{\mathbf{y}})\)

where \(\mathbf{y}\) is the output vector of the neural network, and \(\hat{\mathbf{y}}\) is the actual value (target) vector. Since \(y_i\) is a constant (each element of the target), only the partial derivative with respect to \(\mathbf{y}\) remains.

Note: In the example code in Chapter 1, we used the term \(-\frac{2}{M}\), which included a negative sign (-) in the definition of the loss function. Here, we use the general definition of MSE, so we use the positive term \(\frac{2}{M}\). In actual learning, the absolute size of this constant is not important because it is multiplied by the learning rate.

Gradient of the Loss Function with Respect to Weights

Now, let’s calculate the gradient of the loss function \(E\) with respect to the weights \(\mathbf{W}\). \(E = f(\mathbf{y})\) and \(\mathbf{y} = \mathbf{xW} + \mathbf{b}\). \(\mathbf{x}\) is the input vector, \(\mathbf{W}\) is the weight matrix, and \(\mathbf{b}\) is the bias vector.

Computational Graph:

To visually represent the backpropagation process, we can use a computational graph. (Insert computational graph figure)

\(E\) is a scalar value, and for each \(w_{ij}\) (each element of the weight matrix \(\mathbf{W}\)), we need to find the partial derivative of \(E\). \(\mathbf{W}\) is a matrix of size (input dimension) x (output dimension). For example, if the input is 3-dimensional (\(x_1, x_2, x_3\)) and the output is 2-dimensional (\(y_1, y_2\)), then \(\mathbf{W}\) is a 3x2 matrix.

\(\frac{\partial E}{\partial \mathbf{W}} = \begin{bmatrix} \frac{\partial E}{\partial w_{11}} & \frac{\partial E}{\partial w_{12}} \\ \frac{\partial E}{\partial w_{21}} & \frac{\partial E}{\partial w_{22}} \\ \frac{\partial E}{\partial w_{31}} & \frac{\partial E}{\partial w_{32}} \end{bmatrix}\)

The derivative of \(E\) with respect to \(\mathbf{y}\) can be expressed as a row vector: \(\frac{\partial E}{\partial \mathbf{y}} = \begin{bmatrix} \frac{\partial E}{\partial y_1} & \frac{\partial E}{\partial y_2} \end{bmatrix}\). (Using numerator notation). Strictly speaking, the gradient should be expressed as a column vector, but here we use a row vector for convenience of calculation.

By the chain rule, \(\frac{\partial E}{\partial \mathbf{W}} = \frac{\partial E}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{W}}\)

\(\frac{\partial E}{\partial w_{ij}} = \sum_k \frac{\partial E}{\partial y_k} \frac{\partial y_k}{\partial w_{ij}}\) (here \(k\) is the index of the output vector \(\mathbf{y}\))

Expanding the equation,

\(\frac{\partial E}{\partial \mathbf{W}} = \frac{\partial E}{\partial y_1} \frac{\partial y_1}{\partial \mathbf{W}} + \frac{\partial E}{\partial y_2} \frac{\partial y_2}{\partial \mathbf{W}}\)

Now, we need to calculate \(\frac{\partial y_k}{\partial w_{ij}}\). Since \(\mathbf{y} = \mathbf{xW} + \mathbf{b}\),

\(y_1 = x_1w_{11} + x_2w_{21} + x_3w_{31} + b_1\) \(y_2 = x_1w_{12} + x_2w_{22} + x_3w_{32} + b_2\)

\(\frac{\partial y_1}{\partial w_{ij}} = \begin{bmatrix} \frac{\partial y_1}{\partial w_{11}} & \frac{\partial y_1}{\partial w_{12}} \\ \frac{\partial y_1}{\partial w_{21}} & \frac{\partial y_1}{\partial w_{22}} \\ \frac{\partial y_1}{\partial w_{31}} & \frac{\partial y_1}{\partial w_{32}} \end{bmatrix} = \begin{bmatrix} x_1 & 0 \\ x_2 & 0 \\ x_3 & 0 \end{bmatrix}\)

\(\frac{\partial y_2}{\partial w_{ij}} = \begin{bmatrix} 0 & x_1 \\ 0 & x_2 \\ 0 & x_3 \end{bmatrix}\)

Therefore,

\(\frac{\partial E}{\partial \mathbf{W}} = \frac{\partial E}{\partial y_1} \begin{bmatrix} x_1 & 0 \\ x_2 & 0 \\ x_3 & 0 \end{bmatrix} + \frac{\partial E}{\partial y_2} \begin{bmatrix} 0 & x_1 \\ 0 & x_2 \\ 0 & x_3 \end{bmatrix} = \begin{bmatrix} \frac{\partial E}{\partial y_1}x_1 & \frac{\partial E}{\partial y_2}x_1 \\ \frac{\partial E}{\partial y_1}x_2 & \frac{\partial E}{\partial y_2}x_2 \\ \frac{\partial E}{\partial y_1}x_3 & \frac{\partial E}{\partial y_2}x_3 \end{bmatrix} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \begin{bmatrix} \frac{\partial E}{\partial y_1} & \frac{\partial E}{\partial y_2} \end{bmatrix} = \mathbf{x}^T \frac{\partial E}{\partial \mathbf{y}}\)

Generalization:

If the input is a \(1 \times m\) row vector \(\mathbf{x}\) and the output is a \(1 \times n\) row vector \(\mathbf{y}\), then the weight \(\mathbf{W}\) becomes an \(m \times n\) matrix. In this case, \(\frac{\partial E}{\partial \mathbf{W}} = \mathbf{x}^T \frac{\partial E}{\partial \mathbf{y}}\)

Gradient of the Loss Function with Respect to Input

The gradient of the loss function \(E\) with respect to the input \(\mathbf{x}\) can also be calculated using the chain rule.

\(\frac{\partial E}{\partial \mathbf{x}} = \frac{\partial E}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}\)

Since \(\mathbf{y} = \mathbf{xW} + \mathbf{b}\), we have \(\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \mathbf{W}^T\).

Therefore,

\(\frac{\partial E}{\partial \mathbf{x}} = \frac{\partial E}{\partial \mathbf{y}} \mathbf{W}^T\)

Gradient with Respect to Bias

The gradient of the loss function with respect to the bias \(\mathbf{b}\) is as follows.

\(\frac{\partial E}{\partial \mathbf{b}} = \frac{\partial E}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{b}}\)

Since \(\mathbf{y} = \mathbf{xW} + \mathbf{b}\), we have \(\frac{\partial \mathbf{y}}{\partial \mathbf{b}} = \begin{bmatrix} 1 & 1 & \dots & 1\end{bmatrix}\) (a \(1 \times n\) row vector of all ones)

\(\frac{\partial E}{\partial \mathbf{b}} = \frac{\partial E}{\partial \mathbf{y}}\)

Summary and Additional Explanation

Gradient with Respect to Weights: \(\frac{\partial E}{\partial \mathbf{W}} = \mathbf{x}^T \frac{\partial E}{\partial \mathbf{y}}\)
- Calculated as the matrix product of the transpose of the input vector \(\mathbf{x}\) and the gradient of the loss function with respect to the output (\(\frac{\partial E}{\partial \mathbf{y}}\), represented here as a row vector).
Gradient with Respect to Input: \(\frac{\partial E}{\partial \mathbf{x}} = \frac{\partial E}{\partial \mathbf{y}} \mathbf{W}^T\)
- Calculated as the matrix product of the gradient of the loss function with respect to the output (\(\frac{\partial E}{\partial \mathbf{y}}\)) and the transpose of the weight matrix \(\mathbf{W}\). This result is backpropagated to the previous layer and used for updating the weights of that layer.
Gradient with Respect to Bias: \(\frac{\partial E}{\partial \mathbf{b}} = \frac{\partial E}{\partial \mathbf{y}}\)

Equal to the gradient of the loss function with respect to the output.

Utilization of Gradients: These calculated gradients are used in optimization algorithms like Gradient Descent to update the weights and biases. Each parameter is updated in the direction opposite to its gradient, minimizing the loss function.
Notation: In the above description, we used the numerator layout to calculate the gradient. We could also use the denominator layout, but we would eventually get the same update rule. What is important is to use a consistent notation. This book uses the numerator layout.

Through such mathematical processes, deep learning models can learn complex nonlinear transformations from input data to output data.

2.3 Probability and Statistics

Deep learning is deeply rooted in probability and statistical theory to handle the uncertainty of data. In this chapter, we will explore core concepts such as probability distributions, expectations, Bayes’ theorem, and maximum likelihood estimation. These concepts are essential for understanding the learning and inference processes of models.

2.3.1 Probability Distributions and Expectations

Challenge: How can we mathematically model the uncertainty of real data?

Researcher’s Concern: Early machine learning researchers recognized that real-world data cannot be explained by deterministic rules. Data contains measurement errors, noise, and unpredictable variability. Mathematical tools were needed to quantify this uncertainty and reflect it in models.

A probability distribution represents all possible outcomes and their occurrence probabilities. It can be divided into discrete and continuous probability distributions.

Discrete Probability Distributions

Discrete probability distributions deal with cases where the values that a random variable can take are finite or countable. The characteristic is that a clear probability can be assigned to each possible outcome.

Mathematically, a discrete probability distribution is represented by a probability mass function (PMF).

\[P(X = x) = p(x)\]

where \(p(x)\) is the probability that \(X\) takes the value \(x\). The main properties are as follows:

For all \(x\), \(0 ≤ p(x) ≤ 1\)
\(\sum_{x} p(x) = 1\)

Representative examples include the Bernoulli distribution, binomial distribution, and Poisson distribution.

The probability mass function for rolling a die is as follows:

\[P(X = x) = \begin{cases} \frac{1}{6} & \text{if } x \in \{1, 2, 3, 4, 5, 6\} \ 0 & \text{otherwise} \end{cases}\]

Discrete probability distributions are used in various fields such as classification problems, reinforcement learning, and natural language processing in machine learning and deep learning. The following is the result of simulating a die roll.

Code

from dldna.chapter_02.statistics import simulate_dice_roll

simulate_dice_roll()

Continuous Probability Distribution

A continuous probability distribution deals with the case where the random variable can take on continuous values. Unlike discrete probability distributions, the probability at a specific point is 0, and we deal with probabilities of intervals. Mathematically, a continuous probability distribution is represented by a probability density function (PDF).

\[f(x) = \lim_{\Delta x \to 0} \frac{P(x < X \leq x + \Delta x)}{\Delta x}\]

Here, f(x) represents the probability density near x. The main properties are as follows:

For all x, f(x) ≥ 0
\(\int_{-\infty}^{\infty} f(x) dx = 1\)

Representative examples include the normal distribution, exponential distribution, and gamma distribution.

The probability density function of the normal distribution is as follows.

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]

Here, μ is the mean and σ is the standard deviation.

Continuous probability distributions are importantly used in various machine learning and deep learning application fields such as regression problems, signal processing, and time series analysis.

Code

from dldna.chapter_02.statistics import plot_normal_distribution

plot_normal_distribution()

Expected Value

The expected value is an important concept that represents the central tendency of a probability distribution. It can be interpreted as the weighted average of all possible values of a random variable. For discrete probability distributions, the expected value is calculated as follows:

\[E[X] = \sum_{i} x_i P(X = x_i)\]

where \(x_i\) is a possible value of the random variable X, and \(P(X = x_i)\) is the probability of that value. For continuous probability distributions, the expected value is calculated through integration.

\[E[X] = \int_{-\infty}^{\infty} x f(x) dx\]

where \(f(x)\) is the probability density function. The expected value has the following important properties:

Linearity: \(E[aX + b] = aE[X] + b\)
Expectation of the product of independent random variables: \(E[XY] = E[X]E[Y]\) (when X and Y are independent)

In deep learning, the expected value is crucial for minimizing loss functions or estimating model parameters. For example, the mean squared error (MSE) is defined as:

\[MSE = E[(Y - \hat{Y})^2]\]

where \(Y\) is the actual value, and \(\hat{Y}\) is the predicted value.

The concept of expected value provides a theoretical basis for optimization algorithms such as stochastic gradient descent and is also essential for estimating value functions in reinforcement learning.

Code

from dldna.chapter_02.statistics import calculate_dice_expected_value

calculate_dice_expected_value()

Expected value of dice roll: 3.5

These fundamental concepts of probability and statistics play a crucial role in the design, learning, and evaluation process of deep learning models. In the next section, we will look at Bayes’ theorem and maximum likelihood estimation based on this.

2.3.2 Bayes’ Theorem and Maximum Likelihood Estimation

Challenge: How can we best estimate a model’s parameters with limited data?

Researcher’s Dilemma: Early statisticians and machine learning researchers often faced the challenge of creating models with limited data. Accurately estimating a model’s parameters with insufficient data was a daunting task. Instead of relying solely on the data, they needed a method to incorporate prior knowledge or beliefs to improve the accuracy of their estimates.

Bayes’ theorem and maximum likelihood estimation are core concepts in probability theory and statistics, widely applied in deep learning for model training and inference.

Bayes’ Theorem

Bayes’ theorem provides a way to calculate conditional probabilities. It is used to update the probability of a hypothesis when new evidence is given. The mathematical expression of Bayes’ theorem is as follows:

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

where: - \(P(A|B)\) is the probability of A given B (posterior probability) - \(P(B|A)\) is the probability of B given A (likelihood) - \(P(A)\) is the probability of A (prior probability) - \(P(B)\) is the probability of B (evidence)

Bayes’ theorem is utilized in machine learning as follows:

Classification problems: calculating the probability of belonging to a specific class in naive Bayes classifiers.
Parameter estimation: calculating the posterior distribution of model parameters.
Decision theory: making optimal decisions under uncertainty.

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is a method for finding the model parameters that best explain the given data. In the context of deep learning, this means finding the weights and biases of a neural network that best describe the observed data. In other words, MLE finds the parameters that maximize the probability of the model generating the training data, which is directly related to the learning process of the model.

Mathematically, given data \(X = (x_1, ..., x_n)\), the likelihood function for parameter \(\theta\) is defined as:

\[L(\theta|X) = P(X|\theta) = \prod_{i=1}^n P(x_i|\theta)\]

The maximum likelihood estimate \(\hat{\theta}_{MLE}\) is found by:

\[\hat{\theta}_{MLE} = \operatorname{argmax}_{\theta} L(\theta|X)\]

In practice, it is more convenient to maximize the log-likelihood:

\[\hat{\theta}_{MLE} = \operatorname{argmax}_{\theta} \log L(\theta|X) = \operatorname{argmax}_{\theta} \sum_{i=1}^n \log P(x_i|\theta)\]

Using log-likelihood has several important mathematical advantages:

Conversion of multiplication to addition: due to the property of logarithms, \(\log(ab) = \log(a) + \log(b)\), the product of probabilities can be converted into a sum of log-probabilities, simplifying calculations and improving numerical stability.
Improved numerical stability: when dealing with very small probability values, multiplication can lead to underflow. Using logs avoids this problem.
Simplification of differentiation: in optimization processes where derivatives are calculated, using log functions simplifies the computations, especially for exponential distributions.
Monotonic increase: the log function is monotonically increasing, meaning that maximizing the likelihood and maximizing the log-likelihood yield the same result.

For these reasons, many machine learning algorithms, including those in deep learning, use log-likelihood for optimization.

Maximum likelihood estimation is applied in deep learning as follows:

Application	Description
1. Parameter Estimation	MLE is used to estimate model parameters by maximizing the likelihood of observing the training data given these parameters.
2. Model Selection	By comparing the log-likelihood values of different models on a test set, one can select the best model for the data, balancing model complexity and fit to the data.
3. Density Estimation	MLE is used in density estimation techniques such as Gaussian Mixture Models (GMMs) to estimate the parameters of the underlying distributions that generate the observed data.

Model learning: The process of minimizing the loss function when learning the weights of a neural network is essentially the same as maximum likelihood estimation.
Probabilistic modeling: It is used to estimate the distribution of data in generative models.
Hyperparameter tuning: It can be used to select the hyperparameters of a model.

Bayes’ theorem and maximum likelihood estimation are closely related. In Bayes estimation, if the prior probability is a uniform distribution, the maximum a posteriori (MAP) estimation is equivalent to maximum likelihood estimation. Mathematically, when \(P(\theta)\) is constant in \(P(\theta|X) \propto P(X|\theta)P(\theta)\), \(\operatorname{argmax}_{\theta} P(\theta|X) = \operatorname{argmax}_{\theta} P(X|\theta)P(\theta)\) holds. This means that when the prior probability does not provide additional information about the parameter, the estimate based on data only (MLE) is consistent with Bayes estimation (MAP).

These concepts are essential for understanding and optimizing the learning and inference processes of deep learning models. In the next section, we will explore the basics of information theory.

Deep Dive: In-Depth Analysis of Bayes’ Theorem

Bayes’ Theorem - In-Depth Analysis

1. Rigorous Derivation of Bayes’ Theorem and Probability Space

Probability Space: Bayes’ theorem is defined on a probability space \((\Omega, \mathcal{F}, P)\).
- \(\Omega\): Sample space (set of all possible outcomes)
- \(\mathcal{F}\): Event space (set of subsets of the sample space, \(\sigma\)-algebra)
- \(P\): Probability measure (function assigning probability to each event in the event space)
Rigorous Definition of Conditional Probability:
- For an event \(B \in \mathcal{F}\) with \(P(B) > 0\), the conditional probability of an event \(A \in \mathcal{F}\) given \(B\) is defined as: \(P(A|B) = \frac{P(A \cap B)}{P(B)}\)
Joint Probability:
- The joint probability of two events \(A, B \in \mathcal{F}\), denoted by \(P(A \cap B)\), represents the probability that both events occur simultaneously.
- Using the definition of conditional probability, it can be expressed as:
  - \(P(A \cap B) = P(A|B)P(B)\)
  - \(P(A \cap B) = P(B|A)P(A)\)
Derivation of Bayes’ Theorem:
1. \(P(A|B)P(B) = P(B|A)P(A)\) (two expressions for joint probability)
2. \(P(A|B) = \frac{P(B|A)P(A)}{P(B)}\) (dividing both sides by \(P(B)\), given \(P(B) > 0\))

2. In-Depth Meaning and Statistical Interpretation of Each Term in Bayes’ Theorem

\(P(A|B)\): Posterior Probability
- Interpretation: The updated probability distribution of hypothesis \(A\) after observing data \(B\). It represents the result of inference based on the data.
- Bayesian Perspective: The posterior probability quantifies uncertainty by combining prior and likelihood, providing a basis for decision-making.
\(P(B|A)\): Likelihood
- Interpretation: The probability of observing data \(B\) given that hypothesis \(A\) is true. It measures how well hypothesis \(A\) explains the data.
- Frequentist vs. Bayesian Perspective:
  - Frequentist: Likelihood is a function of fixed parameters, explaining the distribution of the data.
  - Bayesian: Likelihood is a function providing information about the parameter given the data.
\(P(A)\): Prior Probability
- Interpretation: The probability distribution representing prior belief in hypothesis \(A\) before observing data \(B\).
- Subjective vs. Objective Prior:
  - Subjective: Set based on expert knowledge, previous experience, etc.
  - Objective: Uses a uniform distribution or non-informative prior, containing minimal information.
\(P(B)\): Evidence or Marginal Likelihood
Interpretation: The probability of observing the data \(B\) under all possible hypotheses. It serves as a normalizing constant to make \(P(A|B)\) a probability distribution.
Calculation: \(P(B) = \sum_{A'} P(B|A')P(A')\) (discrete random variable) \(P(B) = \int P(B|A)p(A) dA\) (continuous random variable, where \(p(A)\) is the probability density function)
Model Comparison: Used to compare evidence of different models, such as calculating Bayes factors.

3. Bayes’ Theorem and Bayesian Inference

Core: Bayes’ theorem is the core principle of Bayesian inference, which infers the probability distribution of a parameter or hypothesis given data.
Process:
1. Prior: Set the prior distribution \(p(\theta)\) for the parameter \(\theta\).
2. Likelihood: Define the likelihood function \(p(x|\theta)\), which is the probability of observing data \(x\) given the parameter \(\theta\).
3. Posterior: Calculate the posterior distribution \(p(\theta|x)\) using Bayes’ theorem. \(p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)} = \frac{p(x|\theta)p(\theta)}{\int p(x|\theta')p(\theta') d\theta'}\)
4. Inference: Perform estimation, interval estimation, and hypothesis testing based on the posterior distribution.
Iterative Update: It is possible to update beliefs continuously by using the previous posterior distribution as the new prior distribution when new data arrives. (Sequential Bayesian updating)

4. Extensions and Applications of Bayes’ Theorem

Continuous Random Variables: Bayes’ theorem using probability density functions
Conjugate Prior:
- A prior distribution that makes the posterior distribution belong to the same family of distributions as the prior. It is widely used due to its computational convenience. (e.g., beta distribution - Bernoulli distribution, gamma distribution - Poisson distribution)
Variational Bayes:
- A method for approximating complex posterior distributions.
- Find a manageable distribution similar to the posterior distribution and minimize the Kullback-Leibler divergence between the two distributions.
Markov Chain Monte Carlo (MCMC):
- A method for estimating the characteristics of a posterior distribution by sampling from it.
- Algorithms such as Metropolis-Hastings and Gibbs sampling.
Applications in Deep Learning:
- Bayesian Neural Networks: Treat neural network weights as random variables to quantify the uncertainty of predictions.
- Gaussian Processes: Define a prior distribution on the function space using kernels and calculate the predictive distribution using Bayes’ theorem.

Deep Dive: In-Depth Analysis of Maximum Likelihood Estimation (MLE) and Comparison with MAP (Master’s and Above)

Maximum Likelihood Estimation (MLE) - In-Depth Analysis and Comparison with MAP

1. Specific Example of MLE Calculation

MLE is a method for finding the parameter that best explains the given data. It finds the parameter value that maximizes the likelihood of the observed data.

Likelihood Function:
- When the data \(x_1, x_2, ..., x_n\) are assumed to be independently and identically distributed (i.i.d), the likelihood function is defined as follows. \[L(\theta; x_1, ..., x_n) = \prod_{i=1}^{n} p(x_i | \theta)\]
  - \(\theta\): parameter
  - \(p(x_i | \theta)\): probability (or probability density) of data \(x_i\) given parameter \(\theta\)
Log-Likelihood Function:
- For convenience of calculation, the log-likelihood function is used, which is the logarithm of the likelihood function. \[l(\theta; x_1, ..., x_n) = \log L(\theta; x_1, ..., x_n) = \sum_{i=1}^{n} \log p(x_i | \theta)\]
- Since taking the logarithm does not change the location of the maximum, we can also find the parameter that maximizes the log-likelihood.
MLE Calculation Procedure:
1. Define the likelihood function for the given data and probability distribution model.
2. Take the logarithm of the likelihood function to obtain the log-likelihood function.
3. Differentiate the log-likelihood function with respect to parameter \(\theta\).
4. Find the value of \(\theta\) where the derivative is 0 (if necessary, use the second derivative to determine whether it is a maximum or minimum).
5. The found \(\theta\) value is the MLE estimate.
Specific Example:
- Normal Distribution:
  - Assume that the data \(x_1, ..., x_n\) follow a normal distribution with mean \(\mu\) and variance \(\sigma^2\).
  - Log-likelihood function: \[l(\mu, \sigma^2; x_1, ..., x_n) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2\]
  - By partially differentiating with respect to \(\mu\) and \(\sigma^2\) and finding the point where it becomes 0, the MLE estimates are as follows.
    - \(\hat{\mu}_{MLE} = \frac{1}{n}\sum_{i=1}^{n} x_i\) (sample mean)
    - \(\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^{n} (x_i - \hat{\mu}_{MLE})^2\) (sample variance)
- Bernoulli Distribution:
  - Assume that the data \(x_1, ..., x_n\) follow a Bernoulli distribution with success probability \(p\). (\(x_i = 1\) (success), \(x_i = 0\) (failure))
  - Log-likelihood function: \[ l(p; x_1, ..., x_n) = \sum_{i=1}^n [x_i \log p + (1-x_i) \log (1-p)] \]
  - By differentiating with respect to \(p\) and finding the point where it becomes 0, the MLE estimate is as follows.
    - \(\hat{p}_{MLE} = \frac{1}{n}\sum_{i=1}^{n} x_i\) (number of successes / total number of trials)

2. Advantages and Disadvantages of MLE

Advantages:
- Ease of calculation: Parameter estimation is possible with relatively simple calculations. (Especially for exponential family distributions)
- Asymptotic properties: (Described in more detail below)
  - Consistency: As the sample size increases, the MLE estimator converges to the actual parameter.
  - Asymptotic normality: As the sample size increases, the MLE estimator approaches a normal distribution.
  - Efficiency: It is an unbiased estimator with the smallest variance asymptotically (Cramér–Rao lower bound).
Disadvantages:
- Overfitting possibility: Especially when the sample size is small, it may be overfitted to the data and have poor generalization performance.
- Sensitive to outliers: If there are outliers, the MLE estimate can be greatly distorted.
- Not applicable to all distributions: MLE requires a probabilistic model to be given (not applicable to non-parametric methods).
- Possibility of bias: In some cases, the MLE estimator may be biased (e.g., variance estimation of a normal distribution).

3. Comparison with Maximum A Posteriori (MAP) Estimation

MAP: Based on Bayes’ theorem, it finds the parameter that maximizes the posterior probability by combining the prior probability and likelihood.
MAP estimation: \[ \hat{\theta}_{MAP} = \arg\max_{\theta} p(\theta|x) = \arg\max_{\theta} \frac{p(x|\theta)p(\theta)}{p(x)} = \arg\max_{\theta} p(x|\theta)p(\theta) \]
- \(p(\theta|x)\): Posterior probability
- \(p(x|\theta)\): Likelihood
- \(p(\theta)\): Prior probability
- \(p(x)\): Evidence (constant, so can be ignored)
MLE vs. MAP: | Feature | MLE | MAP | | —————– | ——————————————————————– | ———————————————————————- | | Basis | Frequentist | Bayesian | | Goal | Maximum likelihood | Maximum a posteriori probability | | Prior Probability | Not considered | Considered | | Result | Point estimate | Point estimate (generally) or distribution estimate (in Bayesian inference) | | Overfitting | High risk of overfitting | Can prevent overfitting through prior probability (e.g., regularization effect) | | Computational Complexity | Generally low | Complexity may increase depending on the prior probability (especially when not conjugate prior distribution) |
Influence of Prior Probability:
- Non-informative Prior Distribution: When the prior probability follows a uniform distribution, such as \(p(\theta) \propto 1\) (constant), MAP estimation is equivalent to MLE estimation.
- Informative Prior Distribution: When the prior probability follows a specific distribution (e.g., normal distribution, beta distribution), MAP estimation is affected by the prior probability and differs from MLE estimation. The stronger the prior belief represented by the prior distribution, the closer the posterior is to the prior.

4. Asymptotic Property of MLE

Consistency:
- As the sample size \(n\) approaches infinity, the MLE estimator \(\hat{\theta}_{MLE}\) converges in probability to the true parameter \(\theta_0\). \[\hat{\theta}_{MLE} \xrightarrow{p} \theta_0 \text{ as } n \rightarrow \infty\]
Asymptotic Normality:
- When the sample size \(n\) is sufficiently large, the distribution of the MLE estimator \(\hat{\theta}_{MLE}\) approximates the following normal distribution. \[\sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1})\]
  - \(I(\theta_0)\): Fisher Information Matrix (FIM)
    - \(I(\theta) = -E[\frac{\partial^2}{\partial \theta^2} l(\theta; x_1, ...,x_n)]\) (in the case of a single parameter)
    - FIM represents the curvature of the log-likelihood function and means the amount of information about the parameter.
Efficiency:
- MLE is an efficient estimator that asymptotically achieves the Cramér–Rao lower bound (CRLB). It has the smallest variance asymptotically compared to other unbiased estimators.

2.3.3 Information Theory Basics

Challenge: How can we measure the amount of information and quantify uncertainty?

Researcher’s Dilemma: Claude Shannon faced fundamental questions about the efficient transmission and compression of information in communication systems. He needed a theoretical basis for quantifying information, determining how much data could be compressed without losing information, and how much information could be reliably transmitted through noisy channels.

Information theory is a mathematical theory concerning the compression, transmission, and storage of data, playing a crucial role in evaluating and optimizing model performance in deep learning. In this section, we will explore the core concepts of information theory: entropy, mutual information, and KL divergence.

Entropy

Entropy is a measure of the uncertainty of information. The entropy \(H(P)\) of a probability distribution \(P\) is defined as:

\[H(P) = -\sum_{x} P(x) \log P(x)\]

where \(x\) represents all possible events. The main properties of entropy are:

Non-negativity: \(H(P) ≥ 0\)
Maximum for uniform distribution: Entropy is maximum when all events have equal probabilities.
Zero entropy for certain events: \(H(P) = 0\) when \(P(x) = 1\)

In deep learning, entropy is primarily used as the basis for cross-entropy, a loss function commonly used in classification problems. The following example calculates the entropy of various probability distributions and visualizes the entropy of a binary distribution.

Code

from dldna.chapter_02.information_theory import calculate_entropy
calculate_entropy()

Entropy of fair coin: 0.69
Entropy of biased coin: 0.33
Entropy of fair die: 1.39

Mutual Information

Mutual Information measures the mutual dependence between two probability variables X and Y. It is defined mathematically as follows.

\[I(X;Y) = \sum_{x}\sum_{y} P(x,y) \log \frac{P(x,y)}{P(x)P(y)}\]

The main characteristics of Mutual Information are as follows.

Non-negativity: \(I(X;Y) \ge 0\)
Symmetry: \(I(X;Y) = I(Y;X)\)
Zero when X and Y are independent: If X and Y are independent, \(I(X;Y) = 0\)

Mutual Information is used in various machine learning tasks such as feature selection and dimension reduction. The following example calculates and visualizes the Mutual Information for a simple joint probability distribution.

Code

from dldna.chapter_02.information_theory import mutual_information_example
mutual_information_example()

Mutual Information: 0.0058

KL Divergence

KL (Kullback-Leibler) divergence is a method for measuring the difference between two probability distributions P and Q. The KL divergence of Q from P is defined as follows.

\[D_{KL}(P||Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}\]

The main properties of KL divergence are as follows.

Non-negativity: \(D_{KL}(P||Q) \ge 0\)
Zero if and only if P = Q: \(D_{KL}(P||Q) = 0\) if and only if \(P = Q\)
Asymmetry: In general, \(D_{KL}(P||Q) \ne D_{KL}(Q||P)\)

KL divergence is used in deep learning as follows.

Variational inference: It is used to minimize the difference between an approximate distribution and the actual distribution.
Model compression: It is used for knowledge distillation in teacher-student networks.
Anomaly detection: It is used to measure the difference from a normal data distribution.

The concepts of information theory are closely related to each other. For example, mutual information can be expressed as the difference between entropy and conditional entropy.

\(I(X;Y) = H(X) - H(X|Y)\)

Additionally, KL divergence can be represented as the difference between cross-entropy and entropy.

\(D_{KL}(P||Q) = H(P,Q) - H(P)\)

Here, \(H(P,Q)\) is the cross-entropy of \(P\) and \(Q\). The following calculates the KL divergence between two probability distributions and visualizes the distribution.

Code

from dldna.chapter_02.information_theory import kl_divergence_example
kl_divergence_example()

KL(P||Q): 0.0823
KL(Q||P): 0.0872

These concepts of information theory are widely applied to the design and optimization of deep learning models. For example, they are used in various ways, such as using a combination of reconstruction error and KL divergence as the loss function for autoencoders, or using KL divergence as a constraint for policy optimization in reinforcement learning.

In the next chapter, we will look at how these concepts of probability, statistics, and information theory are applied in actual deep learning models.

Deep Dive: Core Concepts of Information Theory

Core Concepts of Information Theory - Information Content, Cross Entropy, KL-Divergence, Mutual Information

1. Information Content (Self-information)

Definition: The information content (self-information) represents the amount of information that can be obtained when a specific event occurs. Events that occur less frequently have higher information content.
Formula: \[I(x) = -\log(P(x))\]
- \(x\): Event
- \(P(x)\): Probability of event \(x\) occurring
- \(\log\): The base of the logarithm can be 2 (unit: bits), \(e\) (unit: nats), or 10, among others. In deep learning, the natural logarithm (\(e\)) is commonly used.
Intuitive Explanation:
- Rarity: Events with lower probabilities (rare events) have higher information content. For example, “The sun rises in the east” is a certain fact and has almost no information content, while “I won the lottery today” is a very rare event and thus has high information content.
- Reduction of Uncertainty: Information content can be interpreted as a measure of how much uncertainty is reduced after an event occurs.
Properties:
- Since \(0 \le P(x) \le 1\), we have \(I(x) \ge 0\).
- If \(P(x) = 1\) (a certain event), then \(I(x) = 0\).
- The smaller \(P(x)\) is, the larger \(I(x)\) becomes.
- For two independent events \(x\) and \(y\), we have \(I(x, y) = I(x) + I(y)\) (additivity of information content).

2. Cross Entropy

Definition: The cross entropy measures how different two probability distributions \(P\) and \(Q\) are. Considering \(P\) as the true distribution and \(Q\) as the estimated distribution, it represents the average number of bits needed to represent \(P\) using \(Q\).
Derivation:
1. Information Content: The information content of an event \(x\) following the true distribution \(P\): \(I(x) = -\log P(x)\)
2. Average Information (Entropy): The average information (entropy) of the true distribution \(P\): \(H(P) = -\sum_{x} P(x) \log P(x)\)
3. Using Estimated Distribution: When using the estimated distribution \(Q\) to represent the true distribution \(P\), the information content for each event \(x\): \(-\log Q(x)\)
4. Cross Entropy: The average information when representing the true distribution \(P\) using the estimated distribution \(Q\): \[H(P, Q) = -\sum_{x} P(x) \log Q(x)\]
Intuitive Explanation:
- The more similar \(P\) and \(Q\) are, the smaller the cross entropy becomes.
- When \(P = Q\), the cross entropy reaches its minimum value (the entropy \(H(P)\)).
- The more different \(P\) and \(Q\) are, the larger the cross entropy becomes. This means that as the estimated distribution fails to reflect the actual distribution, information loss occurs.
Binary Cross Entropy (BCE):
- Used for binary classification problems with two classes (0 or 1).
- \(P = [p, 1-p]\) (true class probability distribution, where \(p\) is the probability of class 1)
- \(Q = [q, 1-q]\) (predicted class probability distribution, where \(q\) is the predicted probability of class 1)
- \[H(P, Q) = -[p \log q + (1-p) \log (1-q)]\]
Categorical Cross Entropy (CCE):
- Used in multi-class classification problems with multiple classes.
- \(P = [p_1, p_2, ..., p_k]\) (actual class probability distribution, \(p_i\) is the probability of the \(i\)th class, one-hot encoding)
- \(Q = [q_1, q_2, ..., q_k]\) (predicted class probability distribution, \(q_i\) is the probability of predicting the \(i\)th class, softmax)
- \[H(P, Q) = -\sum_{i=1}^{k} p_i \log q_i\]

3. Cross Entropy and Likelihood

Likelihood: The probability that given data occurs in a specific model (parameter).
Negative Log-Likelihood (NLL): The value obtained by taking the logarithm of the likelihood and multiplying it by -1.
Relationship between Cross Entropy and NLL:
- In classification problems, when the model’s output (predicted probability distribution) is \(Q\) and the actual label (one-hot encoding) is \(P\), Cross Entropy is equal to Negative Log-Likelihood.
- Minimizing Cross Entropy is equivalent to maximizing Likelihood (Maximum Likelihood Estimation, MLE).
Application in Deep Learning:
- Using Cross Entropy as a loss function for classification problems in deep learning is equivalent to training the model so that its output follows the actual label distribution (from the perspective of MLE).

4. Relationship between KL-Divergence and Cross Entropy

KL-Divergence (Kullback-Leibler Divergence):
- Another method for measuring the difference between two probability distributions \(P\) and \(Q\) (not a distance concept, asymmetric).
- The KL-Divergence from \(P\) to \(Q\) represents the additional amount of information required to express \(P\) using \(Q\).
- \[D_{KL}(P||Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} = \sum_{x} P(x)[\log P(x) - \log Q(x)]\]
Relationship between KL-Divergence and Cross Entropy:

\[D_{KL}(P||Q) = \sum_{x} P(x) \log P(x) - \sum_{x} P(x) \log Q(x) = -\sum_{x} P(x) \log Q(x) - (-\sum_{x} P(x) \log P(x))\] \[D_{KL}(P||Q) = H(P, Q) - H(P)\]
- \(H(P,Q)\): Cross Entropy
- \(H(P)\): Entropy
- KL-Divergence is the value obtained by subtracting the entropy of \(P\) from the Cross Entropy.
- When \(P\) is fixed, minimizing Cross Entropy is equivalent to minimizing KL-Divergence.

5. Relationship between Mutual Information and Conditional Entropy

Mutual Information:
- A measure of how much information two random variables \(X\) and \(Y\) share with each other.
- Represents the amount by which the uncertainty of \(Y\) decreases when \(X\) is known (or vice versa).
- \[I(X;Y) = \sum_{x, y} P(x, y) \log \frac{P(x, y)}{P(x)P(y)}\]
- \(P(x,y)\): Joint Probability Distribution
- \(P(x)\), \(P(y)\): Marginal Probability Distribution
Conditional Entropy:
- Represents the uncertainty of a random variable \(X\) given a random variable \(Y\). \[H(X|Y) = -\sum_{y} P(y) \sum_{x} P(x|y) \log P(x|y) = -\sum_{x,y} P(x,y) \log P(x|y)\]
Relationship between Mutual Information and Conditional Entropy: \[I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\]
- The mutual information of \(X\) and \(Y\) is equal to the entropy of \(X\) minus the conditional entropy of \(X\) given \(Y\).
- Represents the degree to which the uncertainty of \(X\) decreases when \(Y\) is known.

6. Jensen–Shannon Divergence

Jensen–Shannon Divergence (JSD):
- Another way to measure the distance between two probability distributions \(P\) and \(Q\). Unlike KL-Divergence, it is symmetric and bounded (between 0 and 1).
- \[JSD(P||Q) = \frac{1}{2}D_{KL}(P||M) + \frac{1}{2}D_{KL}(Q||M)\]
  - \(M = \frac{1}{2}(P + Q)\): the average distribution of \(P\) and \(Q\)
Characteristics:
- Symmetry: \(JSD(P||Q) = JSD(Q||P)\)
- Boundedness: \(0 \le JSD(P||Q) \le 1\) (when using log base 2)
- The square root of JSD satisfies the conditions of a distance function (metric).

2.3.4 Loss Function

The loss function measures how different the prediction of a machine learning model is from the actual value. The goal of model training is to find the parameters (weights and biases) that minimize the value of this loss function. Choosing an appropriate loss function has a significant impact on the performance of the model, so it should be selected carefully based on the type of problem and the characteristics of the data.

Definition of Loss Function

In general, the loss function \(L\) can be expressed as follows when the model’s parameters are \(\theta\) and the data point is \((x_i, y_i)\) (where \(y_i\) is the actual value and \(f(x_i; \theta)\) is the model’s predicted value):

\(L(\theta) = \frac{1}{N} \sum_{i=1}^{N} l(y_i, f(x_i; \theta))\)

\(N\) is the number of data points, and \(l\) is a function representing the loss for each individual data point.

Main Loss Functions

The following are loss functions commonly used in machine learning and deep learning:

1. Mean Squared Error (MSE)

Formula: \(MSE = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2\) (\(y_i\): actual value, \(\hat{y}_i\): predicted value)
Characteristics:
- Sensitive to outliers because the error is squared.
- Differentiable and a convex function, making it easy to find the optimal solution using gradient descent.
Usage: Mainly used for regression problems.

2. Mean Absolute Error (MAE)

Formula: \(MAE = \frac{1}{N} \sum_{i=1}^N |y_i - \hat{y}_i|\)
Characteristics:
- Less sensitive to outliers compared to MSE.
- Not differentiable at \(x=0\), but can be handled by automatic differentiation in deep learning frameworks.
Usage: Used for regression problems.

3. Cross-Entropy Loss

Formula:
- Binary Classification: \(L = -\frac{1}{N} \sum_{i=1}^N [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]\)
- Multi-class Classification: \(L = -\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^C y_{ij} \log(\hat{y}_{ij})\) (\(C\): number of classes)
Characteristics:
- Measures the difference between the predicted probability distribution and the actual distribution.
- Tends to converge faster than MSE in classification problems.
- Used with the softmax activation function in the output layer.
Usage: Classification problems (binary classification, multi-class classification).

4. Hinge Loss

Formula: \(L = \max(0, 1 - y \cdot f(x))\) (\(y\): actual class \(\in\) {-1, 1}, \(f(x)\): model’s predicted value)
Characteristics:
- Maximizes the margin between “correct” and “incorrect” classes.
- Not differentiable at \(x=1\).
Usage: Mainly used for binary classification problems, such as in Support Vector Machines (SVM).

Criteria for Selecting a Loss Function

Problem Type: The appropriate loss function differs depending on whether it’s a regression or classification problem.
Data Characteristics: The presence of outliers, class imbalance, and other factors should be considered when selecting a loss function.
Model: The choice of loss function may depend on the specific model being used.

Additional Loss Functions

Kullback-Leibler Divergence (KLD): Measures the difference between two probability distributions P and Q. Mainly used in generative models such as Variational Autoencoders (VAEs).
Focal Loss: A loss function that adjusts Cross-Entropy to work well with imbalanced data, primarily used in object detection tasks.
Huber Loss: Combines MSE and MAE, providing robustness to outliers while being differentiable.
Log-Cosh Loss: Similar to Huber Loss but has the advantage of being twice differentiable at all points.
Contrastive Loss: Used in Siamese Networks to learn embeddings where similar sample pairs are close and dissimilar pairs are far apart.
Triplet Loss: Uses three samples (Anchor, Positive, Negative) to learn embeddings such that the distance between Anchor and Positive is minimized, and the distance between Anchor and Negative is maximized.
CTC Loss: A loss function used in speech recognition, handwriting recognition, etc., when the input sequence and output sequence have different lengths.

Click to view content (Deep Dive: Loss Function In-Depth Analysis)

In-Depth Analysis of Loss Functions

Loss Function and Maximum Likelihood Estimation (MLE)

The learning process of many machine learning models can be explained from the perspective of maximum likelihood estimation (MLE). MLE is a method for finding the model parameters that best explain the given data. Assuming that the data are independent and identically distributed (i.i.d), the likelihood function is defined as follows:

\(L(\theta) = P(D|\theta) = \prod_{i=1}^{N} P(y_i | x_i; \theta)\)

where \(D = \{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\}\) is the training data and \(\theta\) is the model parameter. \(P(y_i | x_i; \theta)\) is the probability (or probability density) that the model outputs \(y_i\) when given \(x_i\) as input.

The goal of MLE is to find the parameter \(\theta\) that maximizes the likelihood function \(L(\theta)\). In practice, it is more convenient to maximize the log-likelihood function:

\(\log L(\theta) = \sum_{i=1}^{N} \log P(y_i | x_i; \theta)\)

MSE and MLE: In linear regression models, if the error follows a normal distribution with a mean of 0 and a variance of \(\sigma^2\), MLE is equivalent to minimizing MSE.

\(P(y_i | x_i; \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - f(x_i; \theta))^2}{2\sigma^2}\right)\)

The log-likelihood function is: \(\log L(\theta) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_i - f(x_i;\theta))^2\)

Except for constants, and assuming \(\sigma^2\) is constant, maximizing the log-likelihood function is equivalent to minimizing MSE.
Cross-Entropy and MLE: In classification problems, the output \(\hat{y}_i\) can be interpreted as a parameter of the Bernoulli distribution (binary classification) or the multinomial distribution (multi-class classification). In this case, MLE is equivalent to minimizing Cross-Entropy Loss.
- Binary Classification (Bernoulli Distribution): If \(\hat{y_i}\) is the probability that \(y_i=1\) as predicted by the model, \(P(y_i|x_i;\theta) = \hat{y_i}^{y_i} (1 - \hat{y_i})^{(1-y_i)}\) Log-likelihood: \(\log L(\theta) = \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)]\)
- Multi-Class Classification (Categorical/Multinoulli Distribution): \(P(y_i | x_i; \theta) = \prod_{j=1}^{C} \hat{y}_{ij}^{y_{ij}}\) (one-hot encoding) Log-likelihood: \(\log L(\theta) = \sum_{i=1}^N \sum_{j=1}^C y_{ij} \log(\hat{y}_{ij})\)
Therefore, minimizing Cross-Entropy Loss is the same process as finding the parameters that best model the data distribution using MLE.

Additional Loss Functions (KLD, Focal Loss)

Kullback-Leibler Divergence (KLD):
Description: Measures the difference between two probability distributions P and Q. P represents the actual data distribution, and Q represents the estimated distribution by the model.
Formula: \(D_{KL}(P||Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}\)
Characteristics:
- Asymmetric: \(D_{KL}(P||Q) \neq D_{KL}(Q||P)\)
- Always non-negative: \(D_{KL}(P||Q) \ge 0\), with \(D_{KL}(P||Q) = 0\) only when \(P=Q\)
- Undefined where \(P(x) = 0\)
Relationship to VAE:
- In Variational Autoencoders (VAE), KL Divergence is used to make the posterior distribution of latent variables closer to a prior distribution, such as a normal distribution.
- The loss function of VAE consists of reconstruction loss and KL Divergence terms.
Focal Loss:
- Description: Proposed to address class imbalance problems by modifying Cross-Entropy Loss, particularly the imbalance between “easy” and “hard” examples.
- Formula: \(FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)\)
  - \(p_t\): probability of the predicted correct class by the model
  - \(\gamma\): focusing parameter (\(\gamma \ge 0\), typically 2)
  - \(\alpha_t\): class-wise weights (optional)
- Characteristics:
  - When \(\gamma = 0\), it becomes the standard Cross-Entropy Loss.
  - When \(\gamma > 0\), it reduces the loss for well-classified samples (\(p_t\) is large) and relatively keeps the loss large for poorly classified samples (\(p_t\) is small), focusing more on hard examples.
  - Class-wise weights can be adjusted using \(\alpha_t\) (e.g., giving more weight to classes with fewer instances).
- Application in Object Detection:
- Object detection problems suffer from severe class imbalance because the background (negative) area far exceeds the object (positive) area.
- Focal Loss mitigates this imbalance, guiding the object detection model to focus more on actual objects rather than the background.

Various Loss Functions (Advanced)

Huber Loss: A loss function that combines the advantages of MSE and MAE. For errors less than a certain value (\(\delta\)), it uses squared error like MSE; for larger errors, it uses absolute error like MAE. It is robust to outliers and differentiable.

\(L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \le \delta \\ \delta(|y - \hat{y}| - \frac{1}{2}\delta) & \text{otherwise} \end{cases}\)
Log-Cosh Loss: Defined as \(\log(\cosh(y - \hat{y}))\). Similar to Huber Loss, it is robust to outliers and has the advantage of being twice differentiable at all points.
Quantile Loss: Used to minimize the prediction error at a specific quantile.
Contrastive Loss, Triplet Loss: These are used in Siamese Network, Triplet Network, etc., and are used to adjust the distance between similar sample pairs/triplets. (For details, refer to relevant papers)
Connectionist Temporal Classification (CTC) Loss: This is used when the alignment between the input sequence and output sequence is not clear, such as in speech recognition and handwriting recognition.

Loss Function Selection Guidelines (Advanced)

Outlier handling: If there are many outliers and robustness to outliers is required, MAE, Huber Loss, Quantile Loss, etc. can be considered.
Differentiability: For gradient-based optimization, a differentiable loss function is needed. However, even for non-differentiable cases like Hinge Loss and MAE, subgradients (subdifferentials) can be used or automatic differentiation in deep learning frameworks can be utilized.
Probabilistic modeling: If the model output is to be interpreted as a probability distribution, Cross-Entropy Loss is suitable.
Class imbalance: In cases of severe class imbalance, Focal Loss, Weighted Cross-Entropy, etc. can be considered.
Multiple outputs: When there are multiple outputs and correlations exist between them, loss functions for each output can be combined.

The loss function is one of the key factors determining the performance of a deep learning model. The ability to select an appropriate loss function considering the problem characteristics, data distribution, and model structure, and to design new loss functions if necessary, is required for deep learning engineers.

Click to view content (Deep Dive: Designing a New Loss Function)

Designing New Loss Functions

Existing loss functions (MSE, Cross-Entropy, etc.) are not always the optimal choice. Depending on the specific requirements of the problem, data distribution, and model architecture, it may be necessary to design new loss functions. Designing new loss functions is an important part of deep learning research and has the potential to significantly improve model performance.

When New Loss Functions Are Needed

Special structure of data: When the data does not follow a common distribution (Gaussian, Bernoulli, etc.) or has a special structure (e.g., ranking, sparsity, hierarchical, graph structure).
Specific constraints of the problem: When specific constraints (e.g., monotonicity, sparsity, fairness, robustness) are desired to be imposed on the model’s predictions.
Limitations of existing loss functions: When existing loss functions do not work well for a particular problem (e.g., sensitive to outliers, class imbalance) or do not sufficiently reflect the desired goals. When specific metrics need to be directly optimized.
Multi-objective optimization: When multiple loss functions need to be combined and optimized simultaneously (e.g., balancing prediction accuracy and model complexity).
Generative models: Generative models aim to learn the distribution of data, so they require different loss functions than typical classification or regression problems.

Principles for Designing New Loss Functions

When designing new loss functions, the following principles should be considered:

Problem definition and objectives: Clearly define the problem to be solved and the ultimate goals of the model. The loss function is a key element in defining what the model should learn. (e.g., simply increasing classification accuracy, doing better on specific classes, adjusting False Positive/False Negative ratios, etc.)
Mathematical validity:
- Differentiability: For gradient-based optimization, the loss function should be differentiable at almost all points. If there are non-differentiable points, it should be possible to use subgradients (subdifferentials).
- Convexity: If the loss function is convex, it guarantees finding the global minimum. Even for non-convex functions, the design should aim to find good local minima.
- Preventing Gradient Vanishing/Exploding: Very large or very small gradients make learning unstable. Care should be taken to avoid situations where gradients become 0 or very small, such as the “dying ReLU” problem or vanishing gradients in sigmoid/tanh.
- Scale Invariance: The loss function’s value should not change significantly with the scale of input data or parameters.
Interpretability: If the meaning of the loss function can be intuitively understood, it helps analyze and debug the model’s learning process. Each term should have a clear role and meaning, including the interpretation of hyperparameters and their influence.
Computational Efficiency: Since the loss function is computed at each iteration and for all (or mini-batch) data points, high computational costs can slow down training.

Methodology for Designing New Loss Functions

Existing loss function modification/combination:
- Weight addition: Assigns greater weights to specific data points, classes, or outputs (e.g., Weighted Cross-Entropy, Focal Loss).
- Regularization term addition: Adds a regularization term to limit the model’s complexity or encourage certain properties (e.g., L1 regularization, L2 regularization, Elastic Net). Regularization terms can also be added for output smoothness.
- Combination of multiple loss functions: Combines multiple existing loss functions through linear combination (weighted sum) or other methods (e.g., Multi-task learning).
- Soft/Hard Label Smoothing: Prevents the model from being too confident in its predictions using Label Smoothing Regularization.
Probabilistic modeling-based design:
- Maximum likelihood estimation (MLE): Designs loss functions by assuming a data distribution and estimating its parameters (e.g., MSE is MLE under Gaussian distribution assumption, Cross-Entropy is MLE under Bernoulli/multinomial distribution assumption).
- Variational Inference: Uses variational inference methods to design loss functions that approximate intractable posterior distributions (e.g., ELBO, Evidence Lower Bound) (e.g., Variational Autoencoder).
- Implicit Likelihood: Uses likelihood-free methods (e.g., GAN) when explicitly calculating the likelihood is difficult in generative models.
Problem-specific loss function design:
- Ranking Loss: Designs loss functions suitable for ranking problems (e.g., pairwise ranking loss, listwise ranking loss, margin ranking loss).
- Object Detection Loss: Designs loss functions that consider both bounding box regression and class classification in object detection problems (e.g., YOLO, SSD, Faster R-CNN’s loss function).
- Segmentation Loss: Designs loss functions that predict each pixel’s class and minimize the difference with the ground truth segmentation map in image segmentation problems (e.g., Dice Loss, IoU Loss, Tversky Loss).
- Generative Model Loss: Uses loss functions for generators and discriminators in generative models like GAN and VAE (e.g., Wasserstein distance, Adversarial Loss).
- Metric Learning Loss: Includes Contrastive Loss, Triplet Loss, N-pair Loss, etc.
- Sequence Loss: Includes CTC Loss, sequence-to-sequence model’s Cross-Entropy, etc.
- Graph Data Loss: Uses loss functions in Graph Neural Networks (e.g., node classification, link prediction, graph classification, etc.)

Precautions when designing new loss functions

Excessive complexity: A loss function that is too complex can make learning difficult and cause overfitting. It’s best to start with a simple loss function and gradually increase the complexity.
Hyperparameter tuning: New loss functions often include additional hyperparameters (e.g., \(\gamma\) in Focal Loss, weights when combining weights). It’s crucial to tune these hyperparameters properly and find the optimal values through methods like cross-validation.
Theoretical/empirical justification: When proposing a new loss function, it’s essential to provide theoretical justification (e.g., mathematical characteristics of a specific problem, relationship with MLE) or empirical evidence (e.g., experimental results) explaining why this loss function works well.

Designing a new loss function is a creative process, but it also requires a careful approach. It’s vital to deeply understand the essence of the problem, design based on mathematical/statistical principles, and thoroughly verify performance through experiments.

This chapter has examined the mathematical foundations of deep learning. It explored how concepts from various fields such as linear algebra, calculus, probability and statistics, and information theory are used in the design, training, and analysis of deep learning models. These mathematical tools are essential for understanding complex neural network structures, developing efficient learning algorithms, evaluating and improving model performance, and also play a crucial role in finding new breakthroughs at the forefront of deep learning research.

Exercises

1. Linear Algebra

Basic

Calculate the dot product of two vectors \(\mathbf{a} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}\) and \(\mathbf{b} = \begin{bmatrix} 3 \\ 4 \end{bmatrix}\).
Calculate the product \(\mathbf{Ab}\) of matrix \(\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}\) and vector \(\mathbf{b} = \begin{bmatrix} 5 \\ 6 \end{bmatrix}\).
Create a 2x2 identity matrix.
Write the definition of L1 norm and L2 norm of a vector, and calculate the L1 norm and L2 norm of vector \(\mathbf{v} = \begin{bmatrix} 3 \\ -4 \end{bmatrix}\).

Application

Find the eigenvalue and eigenvector of matrix \(\mathbf{A} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\).
Determine if the inverse of the given matrix exists, and if it does, calculate the inverse. \(\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}\)
Given a linear transformation \(T(\mathbf{x}) = \mathbf{Ax}\), explain how the basis vectors \(\mathbf{e_1} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\) and \(\mathbf{e_2} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}\) are transformed, and visualize the result. (Assume \(\mathbf{A} = \begin{bmatrix} 2 & -1 \\ 1 & 1 \end{bmatrix}\))
Calculate the rank of the following matrix. \(\mathbf{A} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}\)

Advanced

Write the definition of Singular Value Decomposition (SVD) and decompose the given matrix \(\mathbf{A}\) using SVD. \(\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix}\)
Explain the purpose and process of Principal Component Analysis (PCA), and perform PCA on the given dataset to reduce its dimension to 1.
```
import numpy as np
data = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
```
Find the basis of the null space and column space of the following matrix. \(\mathbf{A} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}\)
Write the definition of QR decomposition and perform QR decomposition on the given matrix \(\mathbf{A}\). (QR decomposition is a numerically stable method used to solve linear equations or eigenvalue problems.) \(\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}\)

Click to view contents (answer)

Exercise Answers

1. Linear Algebra

Basic

Dot Product Calculation: \(\mathbf{a} \cdot \mathbf{b} = (1)(3) + (2)(4) = 3 + 8 = 11\)
Matrix-Vector Multiplication: \(\mathbf{Ab} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \begin{bmatrix} 5 \\ 6 \end{bmatrix} = \begin{bmatrix} (1)(5) + (2)(6) \\ (3)(5) + (4)(6) \end{bmatrix} = \begin{bmatrix} 17 \\ 39 \end{bmatrix}\)
2x2 Identity Matrix: \(\mathbf{I} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}\)
L1, L2 Norm:
- L1 Norm (Manhattan Distance): \(||\mathbf{v}||_1 = \sum_{i} |v_i|\)
- L2 Norm (Euclidean Distance): \(||\mathbf{v}||_2 = \sqrt{\sum_{i} v_i^2}\)
\(\mathbf{v} = \begin{bmatrix} 3 \\ -4 \end{bmatrix}\) \(||\mathbf{v}||_1 = |3| + |-4| = 3 + 4 = 7\) \(||\mathbf{v}||_2 = \sqrt{(3)^2 + (-4)^2} = \sqrt{9 + 16} = \sqrt{25} = 5\)

Application

Eigenvalue, Eigenvector: \(\mathbf{A} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\)
- Characteristic Equation: \(\det(\mathbf{A} - \lambda\mathbf{I}) = 0\) \((2-\lambda)^2 - (1)(1) = 0\) \(\lambda^2 - 4\lambda + 3 = 0\) \((\lambda - 3)(\lambda - 1) = 0\) \(\lambda_1 = 3\), \(\lambda_2 = 1\)
- Eigenvector (λ = 3): \((\mathbf{A} - 3\mathbf{I})\mathbf{v} = 0\) \(\begin{bmatrix} -1 & 1 \\ 1 & -1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\) \(x = y\), \(\mathbf{v_1} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\) (or any scalar multiple)
- Eigenvector (λ = 1): \((\mathbf{A} - \mathbf{I})\mathbf{v} = 0\) \(\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\) \(x = -y\), \(\mathbf{v_2} = \begin{bmatrix} -1 \\ 1 \end{bmatrix}\) (or any scalar multiple)
Inverse Matrix: \(\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}\)

Existence of Inverse: \(\det(\mathbf{A}) = (1)(4) - (2)(3) = 4 - 6 = -2 \neq 0\). The inverse exists.
Inverse Calculation: \(\mathbf{A}^{-1} = \frac{1}{\det(\mathbf{A})} \begin{bmatrix} 4 & -2 \\ -3 & 1 \end{bmatrix} = \frac{1}{-2} \begin{bmatrix} 4 & -2 \\ -3 & 1 \end{bmatrix} = \begin{bmatrix} -2 & 1 \\ 1.5 & -0.5 \end{bmatrix}\)

Linear Transformation Visualization:
- \(T(\mathbf{e_1}) = \mathbf{A}\mathbf{e_1} = \begin{bmatrix} 2 & -1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 2 \\ 1 \end{bmatrix}\)
- \(T(\mathbf{e_2}) = \mathbf{A}\mathbf{e_2} = \begin{bmatrix} 2 & -1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} -1 \\ 1 \end{bmatrix}\)
- Visualization: The original basis vectors \(\mathbf{e_1}\), \(\mathbf{e_2}\) are transformed into \(\begin{bmatrix} 2 \\ 1 \end{bmatrix}\), \(\begin{bmatrix} -1 \\ 1 \end{bmatrix}\), respectively, which is plotted on the coordinate plane.
Rank Calculation: \(\mathbf{A} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}\) When transformed into row echelon form, two rows have non-zero values, so the rank is 2. (The third row can be expressed as a linear combination of the first and second rows.)

Advanced

SVD: \(\mathbf{A} = \mathbf{U\Sigma V^T}\)
- \(\mathbf{U}\): An orthogonal matrix with columns being the eigenvectors of \(\mathbf{A}\mathbf{A}^T\)
- \(\mathbf{\Sigma}\): A diagonal matrix with singular values (square roots of the eigenvalues of \(\mathbf{A}\mathbf{A}^T\)) as its diagonal elements
- \(\mathbf{V}\): An orthogonal matrix with columns being the eigenvectors of \(\mathbf{A}^T\mathbf{A}\)
(The calculation process is omitted. It can be calculated using libraries like NumPy: U, S, V = np.linalg.svd(A))

PCA:

import numpy as np

data = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])

# 1. Data centralization (subtracting the mean)
mean = np.mean(data, axis=0)
centered_data = data - mean

# 2. Covariance matrix calculation
covariance_matrix = np.cov(centered_data.T)

# 3. Eigenvalue and eigenvector calculation
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

# 4. Principal component selection (eigenvector corresponding to the largest eigenvalue)
#    Sort the eigenvalues in descending order and select the eigenvector corresponding to the largest eigenvalue
sorted_indices = np.argsort(eigenvalues)[::-1]  # Indices of descending order
largest_eigenvector = eigenvectors[:, sorted_indices[0]]

5. Projection onto one dimension

projected_data = centered_data.dot(largest_eigenvector)

print(projected_data)


3.  **Null Space and Column Space Basis:**
    $\mathbf{A} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}$

    *   **Null Space:** Finding $\mathbf{x}$ that satisfies $\mathbf{Ax} = 0$.
        By transforming into row echelon form and solving, 
        $\mathbf{x} = t\begin{bmatrix} 1 \\ -2 \\ 1 \end{bmatrix}$ (where $t$ is an arbitrary constant) form.
        Therefore, the basis for the null space is $\begin{bmatrix} 1 \\ -2 \\ 1 \end{bmatrix}$

    *   **Column Space:** The space generated by linear combinations of column vectors of matrix $\mathbf{A}$.
        In row echelon form, the original column vectors corresponding to pivot columns become the basis.
         $\begin{bmatrix} 1 \\ 4 \\ 7 \end{bmatrix}$, $\begin{bmatrix} 2 \\ 5 \\ 8 \end{bmatrix}$

4.  **QR Decomposition:**
    $\mathbf{A} = \mathbf{QR}$
    *   $\mathbf{Q}$: A matrix with orthonormal column vectors
    *   $\mathbf{R}$: An upper triangular matrix

    (The calculation process uses the Gram-Schmidt orthogonalization process or calculates using libraries like NumPy: `Q, R = np.linalg.qr(A)`)
:::

## Practice Problems

### 2 Calculus and Optimization

#### Basic

1.  Find the derivative $f'(x)$ of the function $f(x) = x^3 - 6x^2 + 9x + 1$.

2.  Find the partial derivatives $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$ of the function $f(x, y) = x^2y + 2xy^2$.

3.  Find the derivative $f'(x)$ of the function $f(x) = \sin(x^2)$ using the chain rule.

#### Application

1.  Find the gradient $\nabla f$ of the function $f(x, y) = e^{x^2 + y^2}$ and calculate its value at the point (1, 1).

2.  Find all critical points of the function $f(x) = x^4 - 4x^3 + 4x^2$, and determine whether each critical point is a maximum, minimum, or saddle point.

3.  Find the Jacobian matrix of the function $f(x, y) = \begin{bmatrix} x^2 + y^2 \\ 2xy \end{bmatrix}$.

#### Advanced

1.  Use the Lagrange multiplier method to find the maximum and minimum values of the function $f(x, y) = xy$ subject to the constraint $g(x, y) = x^2 + y^2 - 1 = 0$.

2.  Use Gradient Descent to find the minimum value of the function $f(x) = x^4 - 4x^3 + 4x^2$, with initial value $x_0 = 3$, learning rate $\alpha = 0.01$, and 100 iterations.

3.  Express the gradient $\nabla f$ of the function $f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}$ in terms of $\mathbf{A}$ and $\mathbf{x}$, where $\mathbf{A}$ is a symmetric matrix.

4.  Use Newton's method to find a root of the equation $x^3 - 2x - 5 = 0$.

::: {.callout-note collapse="true" title="Click to view contents (answer)"}
## Exercise Solutions

### 2 Calculus and Optimization

#### Basic

1.  **Derivative:**
    $f(x) = x^3 - 6x^2 + 9x + 1$
    $f'(x) = 3x^2 - 12x + 9$

2.  **Partial Derivative:**
    $f(x, y) = x^2y + 2xy^2$
    $\frac{\partial f}{\partial x} = 2xy + 2y^2$
    $\frac{\partial f}{\partial y} = x^2 + 4xy$

3.  **Chain Rule:**
    $f(x) = \sin(x^2)$
    $f'(x) = \cos(x^2) \cdot (2x) = 2x\cos(x^2)$

#### Applied

1.  **Gradient:**
    $f(x, y) = e^{x^2 + y^2}$
    $\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} = \begin{bmatrix} 2xe^{x^2 + y^2} \\ 2ye^{x^2 + y^2} \end{bmatrix}$
    $\nabla f(1, 1) = \begin{bmatrix} 2e^2 \\ 2e^2 \end{bmatrix}$

2.  **Critical Point, Extremum Test:**
    $f(x) = x^4 - 4x^3 + 4x^2$
    $f'(x) = 4x^3 - 12x^2 + 8x = 4x(x-1)(x-2)$
    Critical points: $x = 0, 1, 2$

    $f''(x) = 12x^2 - 24x + 8$
    *   $f''(0) = 8 > 0$: Local minimum
    *   $f''(1) = -4 < 0$: Local maximum
    *   $f''(2) = 8 > 0$: Local minimum

3.  **Jacobian Matrix:**
    $f(x, y) = \begin{bmatrix} x^2 + y^2 \\ 2xy \end{bmatrix}$
    $\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x} & \frac{\partial f_1}{\partial y} \\ \frac{\partial f_2}{\partial x} & \frac{\partial f_2}{\partial y} \end{bmatrix} = \begin{bmatrix} 2x & 2y \\ 2y & 2x \end{bmatrix}$

#### Advanced

1.  **Lagrange Multiplier Method:**
    $L(x, y, \lambda) = xy - \lambda(x^2 + y^2 - 1)$
    $\frac{\partial L}{\partial x} = y - 2\lambda x = 0$
    $\frac{\partial L}{\partial y} = x - 2\lambda y = 0$
    $\frac{\partial L}{\partial \lambda} = x^2 + y^2 - 1 = 0$

    *   $x = \pm \frac{1}{\sqrt{2}}$, $y = \pm \frac{1}{\sqrt{2}}$, $\lambda = \pm \frac{1}{2}$
    *   Maximum: $f(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}) = f(-\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}) = \frac{1}{2}$
    *   Minimum: $f(\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}) = f(-\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}) = -\frac{1}{2}$

2.  **Gradient Descent:**

    ```python
def gradient_descent(f, df, x0, alpha, iterations):
    x = x0
    for i in range(iterations):
        x = x - alpha * df(x)
    return x

f = lambda x: x**4 - 4*x**3 + 4*x**2 df = lambda x: 4*x**3 - 12*x**2 + 8*x

x_min = gradient_descent(f, df, 3, 0.01, 100) print(x_min) # approximately converges to 2

Gradient (matrix form): \(f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}\) \(\nabla f = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}\). Since \(\mathbf{A}\) is a symmetric matrix, \(\nabla f = 2\mathbf{A}\mathbf{x}\)

Newton’s Method: \(f(x) = x^3 - 2x - 5\) \(f'(x) = 3x^2 - 2\) \(x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}\)

def newton_method(f, df, x0, iterations):
  x = x0
  for i in range(iterations):
      x = x - f(x) / df(x)
  return x

f = lambda x: x**3 - 2*x - 5
df = lambda x: 3*x**2 - 2

root = newton_method(f, df, 2, 5) # initial value x0 = 2, 5 iterations
print(root)

Practice Problems

3 Probability and Statistics

Basic

Calculate the probability of getting heads twice when a coin is flipped three times.
Calculate the probability of getting an even number when rolling a die.
Write down the probability density function (PDF) of a normal distribution and explain the meaning of its mean and variance.

Application

Explain Bayes’ theorem and apply it to the following problem:
- A certain disease has a prevalence of 1% and a diagnostic test for this disease has an accuracy (sensitivity and specificity) of 99%. What is the probability that a person actually has the disease when the test result comes out positive?
Explain the concept of Maximum Likelihood Estimation (MLE) and calculate the MLE of the probability of getting heads in a coin flip, given that the coin was flipped five times and got heads three times.
Write down the definition of expectation and the formulas for calculating expectations of discrete and continuous random variables, respectively.

Advanced

Write down the definition of entropy and calculate the entropy of the following probability distribution:
- P(X=1) = 0.5, P(X=2) = 0.25, P(X=3) = 0.25
Given the joint probability distribution of two random variables X and Y as follows, calculate the mutual information I(X;Y):
```
P(X=0, Y=0) = 0.1, P(X=0, Y=1) = 0.2
P(X=1, Y=0) = 0.3, P(X=1, Y=1) = 0.4
```
Given two probability distributions P and Q as follows, calculate the Kullback-Leibler divergence \(D_{KL}(P||Q)\):
- P(X=1) = 0.6, P(X=2) = 0.4
- Q(X=1) = 0.8, Q(X=2) = 0.2
Write down the probability mass function (PMF) of a Poisson distribution and explain with an example when it is used.

Click to view contents (answer)

Exercise Solutions

3 Probability and Statistics

Basic

Coin Tossing: Probability = (number of cases where the front appears twice in three times) * (front probability)^2 * (back probability)^1 = 3C2 * (1/2)^2 * (1/2)^1 = 3 * (1/4) * (1/2) = 3/8
Dice Rolling: Probability = (number of cases where an even number appears) / (total number of cases) = 3 / 6 = 1/2
Normal Distribution: \(f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\)
- \(\mu\): average (center of distribution)
- \(\sigma\): standard deviation (degree of dispersion of distribution)

Application

Bayes’ Theorem: \(P(A|B) = \frac{P(B|A)P(A)}{P(B)}\)
- \(P(A)\): probability of having a disease (prior probability) = 0.01
- \(P(B|A)\): probability that the test result is positive when the disease is present (sensitivity) = 0.99
- \(P(B|\neg A)\): probability that the test result is positive when the disease is not present (1 - specificity) = 0.01 (assuming a specificity of 0.99)
- \(P(B)\): probability of a positive test result = \(P(B|A)P(A) + P(B|\neg A)P(\neg A) = (0.99)(0.01) + (0.01)(0.99) = 0.0198\)
\(P(A|B) = \frac{(0.99)(0.01)}{0.0198} = 0.5\) (50%)
Maximum Likelihood Estimation (MLE):

likelihood function: \(L(p) = p^3 (1-p)^2\) (p is the probability of the coin landing on its front)
log-likelihood function: \(\log L(p) = 3\log p + 2\log(1-p)\)
\(\frac{d}{dp} \log L(p) = \frac{3}{p} - \frac{2}{1-p} = 0\)
\(3(1-p) - 2p = 0\)
\(3 - 5p = 0\)
\(p = \frac{3}{5} = 0.6\)

Expectation:
- Definition: weighted average of possible values of a random variable (weighted by probability)
- \(E(X) = \sum xP(x)\)

Advanced

Entropy: \(H(P) = -\sum P(x)\log P(x)\) (using base-2 logarithm) \(H(P) \approx 1.5\) bits
Mutual Information: \(I(X;Y) = \sum_{x}\sum_{y} P(x,y) \log \frac{P(x,y)}{P(x)P(y)}\)
- \(P(X=0) = 0.1 + 0.2 = 0.3\)
- \(P(X=1) = 0.3 + 0.4 = 0.7\)
- \(P(Y=0) = 0.1 + 0.3 = 0.4\)
- \(P(Y=1) = 0.2 + 0.4 = 0.6\)
\(I(X;Y) = (0.1)\log\frac{0.1}{(0.3)(0.4)} + (0.2)\log\frac{0.2}{(0.3)(0.6)} + (0.3)\log\frac{0.3}{(0.7)(0.4)} + (0.4)\log\frac{0.4}{(0.7)(0.6)}\) \(I(X;Y) \approx 0.0867\) (using base-2 logarithm)
KL Divergence: \(D_{KL}(P||Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}\) \(D_{KL}(P||Q) = 0.6 \log \frac{0.6}{0.8} + 0.4 \log \frac{0.4}{0.2} \approx 0.083\)
Poisson Distribution:
- Probability Mass Function (PMF): \(P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\) (\(k\) is the number of occurrences, \(\lambda\) is the average number of occurrences per unit time/space)
- Example Use Cases:
  - The number of calls to a call center during a certain period
  - The number of traffic accidents in a specific area
  - The number of typos in a book
  - The number of visitors to a website
  - Radioactive decay, genetic mutations

References

Essential References

Linear Algebra and Its Applications (Gilbert Strang, 4th Edition):
- This textbook covers the basic concepts and applications of linear algebra. It clearly explains the core content necessary for deep learning.
- Gilbert Strang’s Linear Algebra Course (MIT OCW)
Calculus (James Stewart, 8th Edition):
- This textbook provides a detailed explanation of the fundamental principles of calculus. It offers background knowledge necessary for understanding optimization algorithms in deep learning.
Probability and Statistics for Engineering and the Sciences (Jay L. Devore, 9th Edition):
- This textbook explains basic probability and statistics concepts along with engineering applications. It helps with understanding probabilistic modeling and uncertainty inference in deep learning.
Pattern Recognition and Machine Learning (Christopher Bishop):
- This is a classic textbook on pattern recognition and machine learning. It covers theoretical backgrounds of deep learning, including probabilistic modeling, Bayesian inference, and information theory, in depth.
The Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman):
- This textbook clearly explains core concepts of statistical learning theory. It is useful for understanding the generalization performance and overfitting issues of deep learning models.
- The Elements of Statistical Learning (Free PDF)
Deep Learning (Ian Goodfellow, Yoshua Bengio, Aaron Courville):
- This textbook comprehensively covers the basic concepts and latest technologies of deep learning. It briefly introduces the mathematical foundations necessary for deep learning.
- Deep Learning Book (Free PDF)
Understanding Machine Learning: From Theory to Algorithms (Shai Shalev-Shwartz, Shai Ben-David):
- This textbook provides a solid foundation for the theoretical aspects of machine learning. It explains important concepts such as PAC learning theory, VC dimension, and bias-variance tradeoff that are crucial for understanding the generalization performance of deep learning models.
Information Theory, Inference, and Learning Algorithms (David J.C. MacKay):
- This textbook explains the principles of machine learning centered around information theory and Bayesian inference. It helps with understanding probabilistic interpretations and generative models in deep learning.
- Information Theory, Inference, and Learning Algorithms (Free PDF)
Mathematics for Machine Learning (Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong)
- This textbook broadly covers the mathematical background knowledge necessary for machine learning.
Matrix Computations (Gene H. Golub, Charles F. Van Loan, 4th Edition):
- This textbook provides an in-depth treatment of numerical methods related to matrix computations. It offers knowledge necessary for implementing and improving optimization algorithms in deep learning.
Linear Algebra and Its Applications (Gilbert Strang, 4th Edition):
- This textbook covers the basic concepts and applications of linear algebra, clearly explaining the core content necessary for deep learning. - Gilbert Strang’s Linear Algebra Course (MIT OCW)
Calculus (James Stewart, 8th Edition):
- This textbook provides a detailed explanation of the basic principles of calculus, offering background knowledge necessary for understanding optimization algorithms in deep learning.
Probability and Statistics for Engineering and the Sciences (Jay L. Devore, 9th Edition):
- This textbook explains the basic concepts of probability and statistics along with their engineering applications, helping to understand probabilistic modeling and uncertainty inference in deep learning.
Pattern Recognition and Machine Learning (Christopher Bishop):
- This is a classic textbook on pattern recognition and machine learning, covering the theoretical background of deep learning, including probabilistic modeling, Bayesian inference, and information theory.
The Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman):
- This textbook clearly explains the core concepts of statistical learning theory, which is useful for understanding the generalization performance and overfitting issues of deep learning models. - The Elements of Statistical Learning (Free PDF)
Deep Learning (Ian Goodfellow, Yoshua Bengio, Aaron Courville):
- This textbook comprehensively covers the basic concepts and latest technologies of deep learning, briefly introducing the necessary mathematical foundations. - Deep Learning Book (Free PDF)
Understanding Machine Learning: From Theory to Algorithms (Shai Shalev-Shwartz, Shai Ben-David):
- This textbook provides a solid foundation for the theoretical aspects of machine learning, explaining important concepts such as PAC learning theory, VC dimension, and bias-variance tradeoff that are crucial for understanding the generalization performance of deep learning models.
Information Theory, Inference, and Learning Algorithms (David J.C. MacKay):
- This textbook explains the principles of machine learning centered around information theory and Bayesian inference, helping to understand probabilistic interpretations and generative models in deep learning. - Information Theory, Inference, and Learning Algorithms (Free PDF)
Mathematics for Machine Learning (Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong)
- This textbook broadly covers the mathematical background knowledge necessary for machine learning. Matrix Computations (Gene H. Golub, Charles F. Van Loan, 4th Edition): - A textbook that deals in depth with numerical methods related to matrix operations. It provides the knowledge necessary for implementing optimization algorithms and improving performance in deep learning.

2. Mathematics of Deep Learning

2.1 Basics of Linear Algebra

2.1.1 Vectors

Vector Space and Linear Combination

Vector Space

Subspace

Linear Combination

Linear Independence and Linear Dependence

Basis and Dimension

Span

Norms and Distances in Deep Learning

Norms

Distances

Affine Space

Definition of Affine Space

Affine Combination

Affine Transformation

Affine Space and Affine Transformation in Deep Learning

Deep Learning Models Without Bias

2.1.2 Dimensions, Ranks

2.1.3 Basics of Linear Transformations

Strict Definition and Additional Properties of Linear Transformations

Strict Definition of Linear Transformation

Example of Linear Transformation in Deep Learning: Fully Connected Layer

Kernel and Range

Rank-Nullity Theorem (Dimension Theorem)

Matrix Representation of Linear Transformations

2.1.4 Tensor Operations

2.1.5 Singular Value Decomposition and Principal Component Analysis

Singular Value Decomposition (SVD)

Principal Component Analysis (PCA)

2.2 Calculus and Optimization

2.2.1 Chain Rule

2.2.2 Gradient and Jacobian

Key Concepts

Derivative Expressions for Various Input and Output Forms

Application in Deep Learning

Hessian Matrix

1. Definition and Meaning of the Hessian Matrix

2. Hessian Matrix and Critical Point Determination

3. Application of the Hessian Matrix in Deep Learning

2.2.3 Chain Rule and Backpropagation in Neural Networks

Applying the Chain Rule in Neural Networks

Gradients and Directional Derivatives

Notes on Gradient Representation

2.2.4 Gradient Calculation for Backpropagation

1. Core Idea of Backpropagation

2. Gradient of the Loss Function

3. Gradient with Respect to Weights

4. Gradient with Respect to Input

5. Summary

Gradient of the Loss Function

Gradient of the Loss Function with Respect to Weights

Gradient of the Loss Function with Respect to Input

Gradient with Respect to Bias

Summary and Additional Explanation

2.3 Probability and Statistics

2.3.1 Probability Distributions and Expectations

Discrete Probability Distributions

Continuous Probability Distribution

Expected Value

2.3.2 Bayes’ Theorem and Maximum Likelihood Estimation

Bayes’ Theorem

Maximum Likelihood Estimation

Bayes’ Theorem - In-Depth Analysis

1. Rigorous Derivation of Bayes’ Theorem and Probability Space

2. In-Depth Meaning and Statistical Interpretation of Each Term in Bayes’ Theorem

3. Bayes’ Theorem and Bayesian Inference

4. Extensions and Applications of Bayes’ Theorem

Maximum Likelihood Estimation (MLE) - In-Depth Analysis and Comparison with MAP

1. Specific Example of MLE Calculation

2. Advantages and Disadvantages of MLE

3. Comparison with Maximum A Posteriori (MAP) Estimation

4. Asymptotic Property of MLE

2.3.3 Information Theory Basics

Entropy

Mutual Information

KL Divergence

Core Concepts of Information Theory - Information Content, Cross Entropy, KL-Divergence, Mutual Information

1. Information Content (Self-information)